ncbi / sra-tools

SRA Tools
Other
1.11k stars 245 forks source link

fasterq-dump only returns one file for a paired-end sample #399

Closed Brunox13 closed 3 years ago

Brunox13 commented 4 years ago

I am trying to download files from project GEO series GSE132044. From what I can tell, most (if not all) of the associated runs are paired-end with multiple fastq files deposited, but I only get a single fastq file every single time.

For example, an associated run SRR9169172 - when I use

fasterq-dump SRR9169172

I only get SRR9169172.fastq, while I expect 3 files. I have tried different runs from this project (most of these runs are single cell RNA-seq), and running the command with different --split parameter values, but always with the same result.

I was wondering if this could have been a similar issue as described here, where something similar happened because only 1 (bam) file associated with a 10X run had originally been deposited by the authors, but this does not seem to be the case because the "Original Format" of the files deposited is 3 fastq files, not a single bam.

So what am I missing here? Is this some sort of an error?

Edit: I also tried downloading the Original Format files using prefetch --type fastq SRR9169172 but that also resulted in a single file.

Edit 2: I updated my sra-tools 2.10.0 -> 2.10.8 and prefetch --type fastq SRR9169172 now retrieves the 3 original fastq files, as expected! fasterq-dump still restuls in a single fastq file, however. Both downloads run WAY faster than with the previous sra-tools version.

wraetz commented 4 years ago

SRR9169172 has 3 fragments per spot. they are labeled as this: technical - biological - technical you can see this yourself if you run: 'vdb-dump SRR9169172 -R1 -C READ_TYPE' fasterq-dump ignores by default the technical reads. you can force the technical reads to be written out by 'fasterq-dump SRR9169172 --include-technical'

Brunox13 commented 4 years ago

Oh, I see - thank you! Running fasterq-dump SRR9169172 --include-technical -S indeed yielded the three files, as desired.

The two "technical" reads are the sample and cell barcodes. I guess I did not expect these to be considered "technical" (and thus excluded by default), given that they are required for any sort of meaningful interpretation of the scRNA-seq data.

I'm wondering if there would be a way to change this default at least for single cell data (or for all cases), or to make it more apparent in the documentation that for single cell data, users will most likely want to use this option? This seems to be a change in defaults from fastq-dump and a lot of people are likely to be confused by this!

For reference - this is the relevant portion from the HowTo: fasterq dump wiki page:

Because we have changed the defaults to be different and more meaningful than fastq-dump, here is a list of equivalent command-lines, but fasterq-dump will be faster.

fastq-dump SRRXXXXXX --split-3 --skip-technical
fasterq-dump SRRXXXXXX

fastq-dump SRRXXXXXX --split-spot --skip-technical
fasterq-dump SRRXXXXXX --split-spot

fastq-dump SRRXXXXXX --split-files --skip-technical
fasterq-dump SRRXXXXXX --split-files

fastq-dump SRRXXXXXX
fasterq-dump SRRXXXXXX --concatenate-reads --include-technical
wraetz commented 4 years ago

The label 'technical' for the 1st and 3rd fragment originates from the submitter of the run. It was specified in the meta-date of the submission. But you are right that this special case should be mentioned in the documentation. Changing defaults is a difficult subject. You cannot make everybody happy. We will probably just add it to the documentation.

klymenko commented 3 years ago

@Brunox13, do you still need help?

Brunox13 commented 3 years ago

@klymenko No, I am all set - thank you! I've left this issue open until the change is made in the documentation but otherwise, feel free to close.

Biglinboy commented 3 years ago

@Brunox13 @klymenko @wraetz Hi, I downloaded files from project GEO series GSE136230, and running this command, fasterq-dump --include-technical -S SRR8856836.1, the following error has occurred.

2021-03-17T09:53:08 fasterq-dump.2.9.6 err: cmn_iter.c cmn_iter_open_db().VDBManagerOpenDBRead( 'SRR8856836.1' ) -> RC(rcVFS,rcMgr,rcOpening,rcDirectory,rcNotFound) 2021-03-17T09:53:08 fasterq-dump.2.9.6 err: sorter.c run_producer_pool(): row_count == 0! 2021-03-17T09:53:08 fasterq-dump.2.9.6 err: sorter.c execute_lookup_production() -> RC(rcVDB,rcNoTarg,rcConstructing,rcParam,rcInvalid) 2021-03-17T09:53:08 fasterq-dump.2.9.6 err: fasterq-dump.c produce_lookup_files() -> RC(rcVDB,rcNoTarg,rcConstructing,rcParam,rcInvalid)

I can only get a single fastq file before using the previous command. fastq-dump --split-3 --gzip SRR8856836.1. while I expect 3 files due to the runs is single cell RNA-seq.Can I get your help?

durbrow commented 3 years ago

SRR8856836 has 152723266 spots, all with one read of length 98. This is not a technical code problem. I will notify the SRA curation team of this.

cattapre commented 3 years ago

I am trying to download the raw data from the study GSE159107, which are single-cell RNA-seq data files, and thus should be made up from 3 files each (R1, R2 and I1). However, I tried every single possibility as listed above with both fastq-dump and fasterq-dump. All of the attempts gave me one single fastq file. I am using sratoolkit version 2.11.0. Could you please help? What should I do? Thanks.

klymenko commented 3 years ago

@cattapre, please email this question to sra-tools@ncbi.nlm.nih.gov

FEI38750 commented 3 years ago

I am trying to download the raw data from the study GSE159107, which are single-cell RNA-seq data files, and thus should be made up from 3 files each (R1, R2 and I1). However, I tried every single possibility as listed above with both fastq-dump and fasterq-dump. All of the attempts gave me one single fastq file. I am using sratoolkit version 2.11.0. Could you please help? What should I do? Thanks.

I have the same problem for downloaded 10x Single-cell RNAseq data from SRA. Please let me know if you have got any reply.

klymenko commented 3 years ago

@FEI38750, email this question to sra-tools@ncbi.nlm.nih.gov.

cattapre commented 3 years ago

I am trying to download the raw data from the study GSE159107, which are single-cell RNA-seq data files, and thus should be made up from 3 files each (R1, R2 and I1). However, I tried every single possibility as listed above with both fastq-dump and fasterq-dump. All of the attempts gave me one single fastq file. I am using sratoolkit version 2.11.0. Could you please help? What should I do? Thanks.

I have the same problem for downloaded 10x Single-cell RNAseq data from SRA. Please let me know if you have got any reply.

I received a message from era-tools@ncbi, and actually many scRNA-seq studies deposit the bam files, and not the very raw data. This is the case in this study.

klymenko commented 3 years ago

Do you still need help?

cattapre commented 3 years ago

Do you still need help?

No, thanks! The problem was solved. The issue was with the files deposited, not faster-dump.

dgoekbuget commented 1 year ago

Hi, failed downloading this SRR14853531 into the expected 3 spot files (R1,R2,I1) using fasterq-dump. However below works: ~/sratoolkit.3.0.1/bin/fastq-dump --split-files --gzip SRR14853531 Before I tried, always resulting in a single fastq file: fasterq-dump --split-files SRR14853531 fasterq-dump --split-3 SRR14853531 fasterq-dump SRR14853531

What am I doing wrong?

durbrow commented 1 year ago

SRR14853531 has two technical reads and one biological read. If you believe this is an error, please email sra@ncbi.nlm.nih.gov and let the curators know.

By default, fasterq-dump ignores technical reads. This is the opposite of fastq-dump. You will need to tell fasterq-dump to output the technical reads.

cparsania commented 1 year ago

Hi,

I cannot separate R1 and R2 with below command although data is paired end.

fasterq-dump -e 24 -p SRR13207026.sra

wraetz commented 1 year ago

Hi, unfortunately SRR13207026 is not paired end. Just run "vdb-dump SRR13207026.sra -R1". You will see there are 2 reads per spot - but only one of them is biological, the other one is technical and of length zero.

READ_LEN: 150, 0 READ_START: 0, 150 READ_TYPE: SRA_READ_TYPE_BIOLOGICAL, SRA_READ_TYPE_TECHNICAL

By the way: anything beyond "-e 8" does not improve speed. We have tried - on 96-core machines. It just exhausts I/O bandwidth.

cparsania commented 1 year ago

Thanks @wraetz. All good for now.

dmalzl commented 1 year ago

Hi,

I have the same problem as @cattapre which is that raw data was deposited as BAM instead of FASTQ. I am using the sra-toolkit version 2.11.0. Currently I am fetching the data using prefetch and then convert with fasterq-dump but due to the deposited data format it seems to not be able to split the reads. The command I am using for fasterq-dump is

  fasterq-dump \
      --split-files --include-technical \
      --threads 6 \
      --outfile SRX11966331_SRR15669711 \
      SRR15669711

Is there a way to make this work for BAM deposited raw data?

wraetz commented 1 year ago

There is no way to make this work, because the submitter of the data did not submit 2 reads per spot, only one. It does not happen very often, but sometimes the run-browser says that the data has 2 reads per spot - but the actual data has not. This run is not one of them - the run-browser clearly says that the run has only 1 spot per read. [https://www.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&acc=SRR15669711&display=metadata]

dmalzl commented 1 year ago

Okay that's a pity. But it clearly says paired layout so there must be two reads. Is there a preferred way to download such data such that one ends up with two read files? Thanks in advance

wraetz commented 1 year ago

Sorry, it does not say paired layout anywhere on the run-browser. Where did you see that mentioned?

wraetz commented 1 year ago

Ah I see - it says that for the experiment. Whoever entered that did so wrongly, or maybe some of the runs in the experiment are of paired layout, but not all... Anyway - if it is misleading please contact the data-curators at NCBI.

dmalzl commented 1 year ago

Okay thanks. I guess I will then just fetch the original BAMs from the ENA FTP.

dmalzl commented 12 months ago

Just for future visitors of this issue with the same problem. The answer of the SRA is as follows:

For 10X data we provision it "provisionally loaded" (eg, not processed for sratoolkit).

For the data processing for the 10x bams, are you using bamtofastq from 10x?

wget https://sra-pub-src-1.s3.amazonaws.com/SRR22746972/HIS_DIS_01M_LS_Der.bam.1
--2023-10-18 11:48:10--  https://sra-pub-src-1.s3.amazonaws.com/SRR22746972/HIS_DIS_01M_LS_Der.bam.1
Resolving sra-pub-src-1.s3.amazonaws.com (sra-pub-src-1.s3.amazonaws.com)... 54.231.226.33, 54.231.130.225, 52.217.42.148, ...
Connecting to sra-pub-src-1.s3.amazonaws.com (sra-pub-src-1.s3.amazonaws.com)|54.231.226.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12331634392 (11G) [binary/octet-stream]
Saving to: 'HIS_DIS_01M_LS_Der.bam.1'
100%[=============================================================================================================================>] 12,331,634,392 21.9MB/s   in 10m 51s
2023-10-18 11:59:02 (18.1 MB/s) - 'HIS_DIS_01M_LS_Der.bam.1' saved [12331634392/12331634392]

bamtofastq_linux HIS_DIS_01M_LS_Der.bam.1 SRR22746972_origs
bamtofastq v1.4.1
Writing finished.  Observed 233902606 read pairs. Wrote 233902606 read pairs

HIS_DIS_01M_LS_Der_GRCh38_0_1_AAAK7VYM5]$ ls
bamtofastq_S1_L001_R1_001.fastq.gz  bamtofastq_S1_L001_R1_004.fastq.gz  bamtofastq_S1_L001_R2_002.fastq.gz  bamtofastq_S1_L001_R2_005.fastq.gz
bamtofastq_S1_L001_R1_002.fastq.gz  bamtofastq_S1_L001_R1_005.fastq.gz  bamtofastq_S1_L001_R2_003.fastq.gz
bamtofastq_S1_L001_R1_003.fastq.gz  bamtofastq_S1_L001_R2_001.fastq.gz  bamtofastq_S1_L001_R2_004.fastq.gz

The gist being that 10x processed BAMs can be converted back to FASTQ using their bamtofastq tool. So you can just download the submitted BAM and the use bamtofastq to convert the data back to its original state

Brunox13 commented 12 months ago

The gist being that 10x processed BAMs can be converted back to FASTQ using their bamtofastq tool. So you can just download the submitted BAM and the use bamtofastq to convert the data back to its original state

Correct - I provided more details on how to do that in the following StackExchange reply: https://bioinformatics.stackexchange.com/a/15523/4446

K-Djordjevic commented 2 months ago

Hi,

I also encountered some issues or better to say, difficulties when downloading data with the SRA toolkit, especially when downloading data in batch mode (it took me some time to figure this out properly). However, on this particular issue here, there is one question from my side:

Besides, I am still a bit shocked at how bad the documentation is when data are made publicly available, and also to navigate through this field of bioinformatics, especially for the most critical step of pre-processing. It's a shame and in fact is a huge waste of time and money, if you always feel insecure about whether you have interpreted all things correctly for you own analysis. Sorry for the long comment, but felt like I have to mention this last point. Thanks for your help and support.

dmalzl commented 2 months ago

@kristinadjordjevic723

Looking at the samples description again one seems to be plain paired-end RNA-seq (the one with two reads) and the other one looks like single-cell RNA-seq where the third read probably is a barcode sequence (given the length and my experience with such data but you may wanna consult the methods of the study to be sure).

Regarding your frustration vent:

This is a common problem with any data-driven field (there is never enough metadata ;)) and in sequence bioinformatics especially because the repositories grew historically from being instantiated for microarray data (at least GEO with SRA as a sort of offspring I guess) to sequencing now and the MINSEQ rules are simply just the bare minimum unfortunately. I often approach this by going through all the sources I have for a sample like often the linked GEO repository and the publications methods and try to puzzle it together in this way. In any case preprocessing should always be done by you (at least I made the experience that this is often not well documented so I usually go straight from raw and do everything myself. It is a bit more work but saves you a lot of guesswork and additionally makes you more familiar with the data also in terms of quality etc)

K-Djordjevic commented 2 months ago

@dmalzl Many thanks for your response! Yes, I agree with you, and starting from raw data is something I did from the very beginning when entering this field - my concern is not so much the repositories, as they grew historically and it's great we have them, the concern is more on the publication site, how little attention and time is sometimes given to the methods section, but still can be published in high impact journals. But this is for sure a bigger topic, but for here. Many thanks again for the prompt answer, it's clearer now for me with the technical replicates and reads.

dmalzl commented 2 months ago

I see and yes this is very annoying but something that unfortunately is part of our profession. But I agree that it is almost comical that especially the computational side often feels neglected. Have a great rest of the week though

stineaj commented 2 months ago

@dmalzl Thank you for your response and assistance for the data question. Indeed both volume and sufficient specifics of metadata is a longtime concern. It is difficult to keep the metadata fields relevant without creating a controlled vocabulary that is too large for the typical submitter and quickly full of antiquated methods. SRA doesn't tend to add controlled vocabulary very frequently and so then the submitter text descriptions become important for understanding the data. However most SRA submissions are not manually curated and submitters will vary quite a bit how much they are willing (or sometimes permitted) to enter in the text descriptions of their metadata. Keeping the barrier to submission low while also keeping the data quality and usability high can sometimes be opposing requirements.

stineaj commented 2 months ago

@dmalzl and @kristinadjordjevic723 the concerns you have both expressed regarding metadata and display are something SRA would like you talk with you more about if you are willing. However the github issues are intended for SRA Toolkit software rather than submissions questions. If you would be willing to talk about this more I would recommend emailing sra@ncbi.nlm.nih.gov. I will let the helpdesk know there may be emails incoming regarding this topic.