Cannot retrieve genome data

gladyspoon320 commented 11 months ago

Hello,

I am attempting to use the command tool datasets.exe on Windows 64-bit to download raw sequencing data from the bioprojects PRJNA648656 and PRJNA648656, both of which returns 'error: no assemblies found that match selection'. These aren't recently published projects - can I be offered some help to how this can be solved?

Many thanks, Gladys

syntheticgio commented 11 months ago

Hi Gladys,

Sorry to hear you're having problems. Both of those bioprojects are the same unless I'm just not seeing a subtle difference. But anyway, for bioproject PRJNA648656 there isn't an associated genome/assembly (so the error message is indeed accurate).

From your question, however, I believe you may be trying to download the experimental raw sequencing data from this project (there are 303 experiments). Currently to do this you would have to go through the SRA Toolkit which provides various executables including one for Windows. It is possible access to SRA data will be included with the datasets command line tool in the future, but currently they need to be accessed in different ways.

Fortunately it is pretty easy. First you need to download an executable (from the above link to SRA Toolkit). From there it will depend on what you want exactly, but assuming you want all of the SRX (Experiments) for this bioproject, you can use the prefetch tool. All of the tools, including prefetch, will be in a bin directory in your extracted SRA Toolkit.

In your case you'll want to .\prefetch.exe PRJNA648656 for one of the bioprojects, assuming you are in the same directory as the executable (otherwise you can put it in the PATH and use it anywhere). This will download the bioproject into a folder.

After this you'll probably want to expand it from the compressed format into reads of some sort. Depending on what you want to do, the fasterq-dump tool might work for your case. You use it in the same way as prefetch (after you've run prefetch!). In this example, in the same directory you downloaded the files with prefetch (there will be a folder with the bioproject ID) run fasterq-dump PRJNA648656.

From there you can explore the directory and should find your FASTQ files related to the bioproject experiments. It is possible that you're specific requirements mean you'll want to use a different tool than fasterq-dump (I believe in almost all cases you will use prefetch to get started). There is documentation at the SRA Toolkit download link to help you determine that.

Hopefully this helps.

gladyspoon320 commented 10 months ago

Hi John,

Thank you so much for the clarification! I looked up the project IDs from two different publications but maybe they belong to the same project indeed. I'll try it again via SRA Toolkit.

Many many thanks, Gladys

From: John Torcivia @.> Sent: Friday, November 24, 2023 6:14 PM To: ncbi/datasets @.> Cc: Yeuk Poon @.>; Author @.> Subject: Re: [ncbi/datasets] Cannot retrieve genome data (Issue #287)

Hi Gladys,

Sorry to hear you're having problems. Both of those bioprojects are the same unless I'm just not seeing a subtle difference. But anyway, for bioproject PRJNA648656https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA648656%20 there isn't an associated genome/assembly (so the error message is indeed accurate).

From your question, however, I believe you may be trying to download the experimental raw sequencing data from this project (there are 303 experiments). Currently to do this you would have to go through the SRA Toolkithttps://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit which provides various executables including one for Windows. It is possible access to SRA data will be included with the datasets command line tool in the future, but currently they need to be accessed in different ways.

Fortunately it is pretty easy. First you need to download an executable (from the above link to SRA Toolkit). From there it will depend on what you want exactly, but assuming you want all of the SRX (Experiments) for this bioproject, you can use the prefetch tool. All of the tools, including prefetch, will be in a bin directory in your extracted SRA Toolkit.

In your case you'll want to .\prefetch.exe PRJNA648656 for one of the bioprojects, assuming you are in the same directory as the executable (otherwise you can put it in the PATH and use it anywhere). This will download the bioproject into a folder.

After this you'll probably want to expand it from the compressed format into reads of some sort. Depending on what you want to do, the fasterq-dump tool might work for your case. You use it in the same way as prefetch (after you've run prefetch!). In this example, in the same directory you downloaded the files with prefetch (there will be a folder with the bioproject ID) run fasterq-dump PRJNA648656.

From there you can explore the directory and should find your FASTQ files related to the bioproject experiments. It is possible that you're specific requirements mean you'll want to use a different tool than fasterq-dump (I believe in almost all cases you will use prefetch to get started). There is documentation at the SRA Toolkit download link to help you determine that.

Hopefully this helps.

— Reply to this email directly, view it on GitHubhttps://github.com/ncbi/datasets/issues/287#issuecomment-1825974348, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AUUHNGKLEQDTUVJTNADTDN3YGDP2HAVCNFSM6AAAAAA7ZHLYTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRVHE3TIMZUHA. You are receiving this because you authored the thread.Message ID: @.***>

syntheticgio commented 10 months ago

Hopefully that works for you! Datasets is working to bring all of the NCBI data under a single umbrella, but it is a journey of a thousand miles :) Feel free to reach out if you run into more problems - I'm not an expert with SRA Toolkit but happy to try to help if I can.

ncbi / datasets

Cannot retrieve genome data #287