Closed mkdevesh closed 5 months ago
Hi @mkdevesh,
Thanks for opening this issue.
it would be useful to select either of them to avoid duplicated fasta sequences
You can do this now using the --assembly-source
flag as follows:
To select only RefSeq genomes:
datasets download genome taxon 470 --assembly-level complete --annotated --assembly-source refseq
To select only GenBank genomes:
datasets download genome taxon 470 --assembly-level complete --annotated --assembly-source genbank
For more information, see the command-line tool reference page for datasets download genome taxon.
Best, Eric
Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI eric.cox@nih.gov
Hi, While I was downloading from the datasets CLI I found out that the fasta sequences are downloaded both for
GenBank
andRefSeq
accession starting from GCA and GCF while the annotation might be different the fasta sequences are same. So it would be useful to select either of them to avoid duplicated fasta sequences for those who only require fasta from one source. Here is the code I used:datasets download genome taxon 470 --assembly-level complete --annotated --filename Coli2_dataset.zip
Thanks