kblin / ncbi-genome-download

Scripts to download genomes from the NCBI FTP servers
Apache License 2.0
939 stars 175 forks source link

problems downloading representative genome #99

Open jotech opened 5 years ago

jotech commented 5 years ago

I'm trying to download the refseq representative genome for Vibrio lentus as it is listed here https://www.ncbi.nlm.nih.gov/genome/?term=vibrio+lentus%5Borgn%5D and it also has the coresponding refseq category http://tiny.cc/hncrdz

But when I try to download the genomes

ncbi-genome-download --dry-run -R representative --taxid 136468 bacteria
ncbi-genome-download --dry-run -R representative --genus "Vibrio lentus" bacteria
ERROR: No downloads matched your filter. Please check your options.

Besides this, ncbi-genome-download --dry-run --taxid 136468 bacteria shows me all 87 available genomes but I'm looking for the representative only. What do I miss?

jrjhealey commented 5 years ago

This appears to be a problem with the use of -R representative, but more specifically with the actual data for that entry...

If you query the assembly_summary file for that genome (wget ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt):

$ grep "GCF_001691195.1" assembly_summary_refseq.txt
GCF_001691195.1 PRJNA224116     SAMN04867935    MAKA00000000.1  na      136468  136468  Vibrio lentus   strain=5F79             latest  Scaffold        Major   Full    2016/07/21      ASM169119v1     Massachusetts Institute of Technology     GCA_001691195.1 identical       ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/691/195/GCF_001691195.1_ASM169119v1

If we compare this with an entry which has representative, there's an na where representative would be expected. I was successfully able to download the genome if providing the accession directly with -A GCA_001691195.1.

$ grep -i "representative" assembly_summary_refseq.txt | head -1
GCF_000001765.3 PRJNA18793      SAMN00779672    AADE00000000.1  representative genome   46245   7237    Drosophila pseudoobscura pseudoobscura  strain=MV2-25           latest  Chromosome      Major   Full    2013/04/11      Dpse_3.0  Baylor College of Medicine      GCA_000001765.2 identical       ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/765/GCF_000001765.3_Dpse_3.0

Perhaps Kai knows different but this appears to be an issue with the actual NCBI records?

Closer inspection of the summary file shows that all 87 genomes have na for that column. I don't think this is something this tool will be able to help you with in which case.

jotech commented 5 years ago

thanks for your answer!

The missing representative tag is really strange because it is actually there in the source assembly report:

curl -s ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/691/195/GCF_001691195.1_ASM169119v1/GCF_001691195.1_ASM169119v1_assembly_report.txt | grep "RefSeq category"
# RefSeq category: Representative Genome

It seems there are inconsistent assembly reports?

jrjhealey commented 5 years ago

That’s certainly how I would interpret that. There may be a good reason for the nas in the assembly summary, but if there are, i dont know what they are!

I think it would be worth contacting NCBI over this though in case it is a mistake.

kblin commented 5 years ago

I concur with @jrjhealey, I think this is just an issue on the NCBI side of things.