kblin / ncbi-genome-download

Scripts to download genomes from the NCBI FTP servers
Apache License 2.0
925 stars 176 forks source link

'Too few arguments' Error while trying to download genomes with specific taxIDs #79

Closed jananiravi closed 5 years ago

jananiravi commented 5 years ago

Hello, Thanks for making bulk genome download from NCBI so much easier! I am looking to download multiple genomes of the species Mycobacterium tuberculosis (taxid: 1773) from refseq/genbank so that I can construct a pangenome with Roary (https://github.com/sanger-pathogens/Roary).

I have looked into the following issues #66, #65, #55, and #14 for some clarification but remain confused as to why I see the too few arguments error when I run ncbi-genome-download with the following options: ncbi-genome-download -s 'refseq' -F 'gff' -l 'complete' -t "1773,83332" -R 'all' -o mtb-genomes/ -H -n -m METADATA_TABLE -v

I have tried various combinations of this & with individual taxIDs too (with -t and -T). I have tried with the default output folder, without the -v, -H, -n (test), -m, options. I always see the following error:

usage: ncbi-genome-download [-h] [-s {refseq,genbank}] [-F FILE_FORMAT]
                            [-l ASSEMBLY_LEVEL] [-g GENUS] [-T SPECIES_TAXID]
                            [-t TAXID] [-A ASSEMBLY_ACCESSIONS]
                            [-R {all,reference,representative}] [-o OUTPUT]
                            [-H] [-u URI] [-p N] [-r N] [-m METADATA_TABLE]
                            [-n] [-N] [-v] [-d] [-V]
                            group
ncbi-genome-download: error: too few arguments

Environment: macOS mojave. ncbi-genome-download version 0.2.7 downloaded with updated pip.

Please advise. Thank you!

jrjhealey commented 5 years ago

You're missing the positional argument of what database type to use.

Just append bacteria to the end of your existing command.

jananiravi commented 5 years ago

That's great! Thank you @jrjhealey Completely missed that argument from my list! For instance, if I select, say, M. tuberculosis, 1773, would it download all the completed genomes of strains/species under this master taxID?

jrjhealey commented 5 years ago

@jananiravi I believe that should work with the options you specified in your first post yeah. You may need to use -T instead of -t though but this still causes some confusion even for me. For instance, see the README usage for how to download all E. coli's:

ncbi-genome-download --species-taxid 562 bacteria

If that fails to return all of the genomes though, you can use the script I added in /contrib/gimme_taxa.py to find all of the TaxIDs for each individual species/strain within M. tuberculosis, and then give that file of IDs to ngd instead.

jananiravi commented 5 years ago

I'm getting different numbers when I check using -n for both -t vs --genus options, although I'm using the identical names/taxIDs mapping. Wonder why that's happening. I'll check with your code to see which subset taxIDs are missing from these two queries.

jananiravi commented 5 years ago

Also, @jrjhealey is there a way to specify only 'completed genomes' as one of the options within gimme_taxa.py? I am checking to see if that's the cause for any discrepancy. I am using the same 'complete' option for both -t and --genus, though.

jrjhealey commented 5 years ago

I’m not at a terminal to test super thoroughly at the moment, so I’m not 100% sure what the distinction between -t and —genus will result in. The genus option is a basic string match though IIRC, so it might be that there are accessions assigned to that taxid that do not have your particular genus string in their metadata names. This is a guess though, @kblin would probably have to weigh in.

As for gimme_taxa, I’m afraid it doesn’t offer any such filtering. The idea behind the script originally was to get an exhaustive list of tax ids to pass in to ngd which could then use all the normal filters to just ignore taxids that didn’t meet the criteria. Your complete filter should return the same end result though given an equivalent set of input taxids (I would expect).

jananiravi commented 5 years ago

Ah OK, feeding the taxIDs output from gimme_taxa.py along with the complete filter to ncbi-genome-download should yield the same set of genomes, then! I will check that out. For some reason, I assumed that the user-specified taxID list will supersede any complete filter specification. thanks for the clarification. 👍 thanks!