ncbi / datasets

NCBI Datasets is a new resource that lets you easily gather data from across NCBI databases.
https://www.ncbi.nlm.nih.gov/datasets
Other
367 stars 40 forks source link

dataset cannot find assembly file #87

Closed Jigyasa3 closed 2 years ago

Jigyasa3 commented 2 years ago

Hey all!

Thank you for this great resource! I am interested in downloading assemblies from a large list of Bioproject ids using dataset. When I try running it on a single Bioproject id, dataset cannot find the assembly even though assemblies are associated with the Bioproject on NCBI.

code used- /flash/BourguignonU/Tool/datasets download genome accession PRJNA508395 --filename PRJNA508395.zip

/flash/BourguignonU/Tool/datasets download genome accession PRJEB22522 --filename PRJEB22522.zip

When I try the example code in the tutorial, /flash/BourguignonU/Tool/datasets download genome accession PRJEB35331 --filename test.zip I get the zip file, showing that the code should work.

The two BioProjects have assemblies associated with them image image

Jigyasa3 commented 2 years ago

When I try datasets on an individual assembly file from one of the above mentioned BioProjects, then it throws an error-

Code used- /flash/BourguignonU/Tool/datasets download genome accession GCA_011075415.1 --filename GCA_011075415.1.zip --exclude-gff3 --exclude-protein --exclude-rna

Error-

Some of the assemblies provided ('GCA_011075415.1') are valid NCBI Assembly Accessions,
but are not in scope for NCBI Datasets.

Error: Input accessions not specified

Is datasets specific for some BioProjects only?

ericcox1 commented 2 years ago

Hi Jigyasa,

Thanks for your feedback.

Both of the BioProjects (PRJNA508395 and PRJEB22522) and the Assembly record (GCA_011075415.1) that you are looking for describe metagenomic assemblies and we currently exclude metagenomes from NCBI Datasets. But you’re not the first person to ask for this data! Based on user feedback such as yours, we are planning to add these genomes to NCBI Datasets, tentatively in the next few months.

You can find more information about which genomes are included and which genomes are excluded from NCBI Datasets here: NCBI Datasets Available genomes

Best, Eric

Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI eric.cox@nih.gov

Jigyasa3 commented 2 years ago

Hey @ericcox1

Thank you so much for replying and directing me to the correct resource for dataset usage. Looking forward to using this resource on metagenomes!

On a similar note, I wanted to ask if I want to batch download assemblies from multiple Bioprojects, is there a currently present tool for that? I have a list of >100 BioProjects that I want to examine and really don't want to manually download the data.

Looking forward to your advice!

ericcox1 commented 2 years ago

You're welcome.

To download data for assemblies from multiple BioProjects, you can provide a list of BioProject accessions as follows:

datasets download genome accession --inputfile bioproject_list.txt

ericcox1 commented 2 years ago

Hi Jigyasa,

I wanted to let you know that metagenomes are now available in Datasets. Querying by the BioProject (PRJNA508395, PRJEB22522) and Assembly (GCA_011075415.1) accessions you mentioned above will now return the complete set of genome data.

You can also see these metagenomes on our Genomes page, for example, for PRJEB22522.

I’m going to close this issue for now. Please let me know if you have any other feedback.

Best, Eric