kblin / ncbi-genome-download

Scripts to download genomes from the NCBI FTP servers
Apache License 2.0
959 stars 173 forks source link

Human readable producing non-unique folders #63

Open tseemann opened 6 years ago

tseemann commented 6 years ago

The strain column isn't unique. Might need to detect this, and appened the GCF_ number to the strain to discriminate?

/home/tseemann/tmp/B.cereus/human_readable/refseq/bacteria/Bacillus/cereus/E33L/GCF_000011625.1_ASM1162v1_genomic.fna.gz

/home/tseemann/tmp/B.cereus/human_readable/refseq/bacteria/Bacillus/cereus/E33L/GCF_000833045.1_ASM83304v1_genomic.fna.gz
kblin commented 6 years ago

Thanks for the report!

kblin commented 4 years ago

Hi @tseemann, finally having some time to look at this. I'm beginning to feel like this works as intended ™️. If there are multiple assemblies for a strain, the strain dir will have multiple files. The way I understand your report is that this isn't what you are expecting.

From your perspective, what is the benefit of having two strain_assembly_id folders, rather than two files in a strain folder?

tseemann commented 4 years ago

These are not different assemblies of the same thing though?

https://www.ncbi.nlm.nih.gov/assembly/GCF_000011625.1/ https://www.ncbi.nlm.nih.gov/assembly/GCF_000833045.1/

The are different biosamples. They just happen to have the same "strain" name but this is not an enforced unique field. Could have came originally from same freezer stock, but been passaged? Some labs use such generic strain IDs that clashes happen all the time.