kblin / ncbi-genome-download

Scripts to download genomes from the NCBI FTP servers
Apache License 2.0
955 stars 175 forks source link

Check if accessions has CDS? #203

Open ElinorSterner opened 1 year ago

ElinorSterner commented 1 year ago

Hello, I want to check if GCA accessions that I pulled from genbank have CDSs, before filtering further to see if I want to download. I used the commands --formats cds-fasta to only look at CDS and -n to check rather than download. However, -n it returns all the GCAs I input, not just ones with CDS files.

I want it to check if a CDS exists without downloading yet, is there a way to do this?

thanks, Elinor

taylorreiter commented 1 year ago

I just had this same question! I ended up doing it outside of ncbi-genome-download. I'm 95% sure my solution is correct :D

The genbank or refseq annotation_hashes.txt file has a columns named "Features hash" and "Proteins name hash." When the values of either of those columns are "D41D8CD98F00B204E9800998ECF8427E", that indicates the file does not exist. Note the annotation_hashes.txt file only exists under subsets of refseq/genbank

Example url, eg for all genbank plant genomes: https://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant/annotation_hashes.txt