ncbi / datasets

NCBI Datasets is a new resource that lets you easily gather data from across NCBI databases.
https://www.ncbi.nlm.nih.gov/datasets
Other
350 stars 39 forks source link

GenBank or RefSeq #342

Closed mkdevesh closed 5 months ago

mkdevesh commented 5 months ago

Hi, While I was downloading from the datasets CLI I found out that the fasta sequences are downloaded both for GenBank and RefSeq accession starting from GCA and GCF while the annotation might be different the fasta sequences are same. So it would be useful to select either of them to avoid duplicated fasta sequences for those who only require fasta from one source. Here is the code I used:

datasets download genome taxon 470 --assembly-level complete --annotated --filename Coli2_dataset.zip

Thanks

ericcox1 commented 5 months ago

Hi @mkdevesh,

Thanks for opening this issue.

it would be useful to select either of them to avoid duplicated fasta sequences

You can do this now using the --assembly-source flag as follows:

To select only RefSeq genomes: datasets download genome taxon 470 --assembly-level complete --annotated --assembly-source refseq

To select only GenBank genomes: datasets download genome taxon 470 --assembly-level complete --annotated --assembly-source genbank

For more information, see the command-line tool reference page for datasets download genome taxon.

Best, Eric

Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI eric.cox@nih.gov