Closed Brunox13 closed 3 years ago
SRR9169172 has 3 fragments per spot. they are labeled as this: technical - biological - technical you can see this yourself if you run: 'vdb-dump SRR9169172 -R1 -C READ_TYPE' fasterq-dump ignores by default the technical reads. you can force the technical reads to be written out by 'fasterq-dump SRR9169172 --include-technical'
Oh, I see - thank you! Running fasterq-dump SRR9169172 --include-technical -S
indeed yielded the three files, as desired.
The two "technical" reads are the sample and cell barcodes. I guess I did not expect these to be considered "technical" (and thus excluded by default), given that they are required for any sort of meaningful interpretation of the scRNA-seq data.
I'm wondering if there would be a way to change this default at least for single cell data (or for all cases), or to make it more apparent in the documentation that for single cell data, users will most likely want to use this option? This seems to be a change in defaults from fastq-dump
and a lot of people are likely to be confused by this!
For reference - this is the relevant portion from the HowTo: fasterq dump wiki page:
Because we have changed the defaults to be different and more meaningful than fastq-dump, here is a list of equivalent command-lines, but fasterq-dump will be faster.
fastq-dump SRRXXXXXX --split-3 --skip-technical fasterq-dump SRRXXXXXX fastq-dump SRRXXXXXX --split-spot --skip-technical fasterq-dump SRRXXXXXX --split-spot fastq-dump SRRXXXXXX --split-files --skip-technical fasterq-dump SRRXXXXXX --split-files fastq-dump SRRXXXXXX fasterq-dump SRRXXXXXX --concatenate-reads --include-technical
The label 'technical' for the 1st and 3rd fragment originates from the submitter of the run. It was specified in the meta-date of the submission. But you are right that this special case should be mentioned in the documentation. Changing defaults is a difficult subject. You cannot make everybody happy. We will probably just add it to the documentation.
@Brunox13, do you still need help?
@klymenko No, I am all set - thank you! I've left this issue open until the change is made in the documentation but otherwise, feel free to close.
@Brunox13 @klymenko @wraetz Hi, I downloaded files from project GEO series GSE136230, and running this command, fasterq-dump --include-technical -S SRR8856836.1
, the following error has occurred.
2021-03-17T09:53:08 fasterq-dump.2.9.6 err: cmn_iter.c cmn_iter_open_db().VDBManagerOpenDBRead( 'SRR8856836.1' ) -> RC(rcVFS,rcMgr,rcOpening,rcDirectory,rcNotFound) 2021-03-17T09:53:08 fasterq-dump.2.9.6 err: sorter.c run_producer_pool(): row_count == 0! 2021-03-17T09:53:08 fasterq-dump.2.9.6 err: sorter.c execute_lookup_production() -> RC(rcVDB,rcNoTarg,rcConstructing,rcParam,rcInvalid) 2021-03-17T09:53:08 fasterq-dump.2.9.6 err: fasterq-dump.c produce_lookup_files() -> RC(rcVDB,rcNoTarg,rcConstructing,rcParam,rcInvalid)
I can only get a single fastq file before using the previous command. fastq-dump --split-3 --gzip SRR8856836.1
. while I expect 3 files due to the runs is single cell RNA-seq.Can I get your help?
SRR8856836 has 152723266 spots, all with one read of length 98. This is not a technical code problem. I will notify the SRA curation team of this.
I am trying to download the raw data from the study GSE159107, which are single-cell RNA-seq data files, and thus should be made up from 3 files each (R1, R2 and I1). However, I tried every single possibility as listed above with both fastq-dump and fasterq-dump. All of the attempts gave me one single fastq file. I am using sratoolkit version 2.11.0. Could you please help? What should I do? Thanks.
@cattapre, please email this question to sra-tools@ncbi.nlm.nih.gov
I am trying to download the raw data from the study GSE159107, which are single-cell RNA-seq data files, and thus should be made up from 3 files each (R1, R2 and I1). However, I tried every single possibility as listed above with both fastq-dump and fasterq-dump. All of the attempts gave me one single fastq file. I am using sratoolkit version 2.11.0. Could you please help? What should I do? Thanks.
I have the same problem for downloaded 10x Single-cell RNAseq data from SRA. Please let me know if you have got any reply.
@FEI38750, email this question to sra-tools@ncbi.nlm.nih.gov.
I am trying to download the raw data from the study GSE159107, which are single-cell RNA-seq data files, and thus should be made up from 3 files each (R1, R2 and I1). However, I tried every single possibility as listed above with both fastq-dump and fasterq-dump. All of the attempts gave me one single fastq file. I am using sratoolkit version 2.11.0. Could you please help? What should I do? Thanks.
I have the same problem for downloaded 10x Single-cell RNAseq data from SRA. Please let me know if you have got any reply.
I received a message from era-tools@ncbi, and actually many scRNA-seq studies deposit the bam files, and not the very raw data. This is the case in this study.
Do you still need help?
Do you still need help?
No, thanks! The problem was solved. The issue was with the files deposited, not faster-dump.
Hi, failed downloading this SRR14853531 into the expected 3 spot files (R1,R2,I1) using fasterq-dump
. However below works:
~/sratoolkit.3.0.1/bin/fastq-dump --split-files --gzip SRR14853531
Before I tried, always resulting in a single fastq
file:
fasterq-dump --split-files SRR14853531
fasterq-dump --split-3 SRR14853531
fasterq-dump SRR14853531
What am I doing wrong?
SRR14853531 has two technical reads and one biological read. If you believe this is an error, please email sra@ncbi.nlm.nih.gov and let the curators know.
By default, fasterq-dump
ignores technical reads. This is the opposite of fastq-dump
. You will need to tell fasterq-dump
to output the technical reads.
Hi,
I cannot separate R1 and R2 with below command although data is paired end.
fasterq-dump -e 24 -p SRR13207026.sra
Hi, unfortunately SRR13207026 is not paired end. Just run "vdb-dump SRR13207026.sra -R1". You will see there are 2 reads per spot - but only one of them is biological, the other one is technical and of length zero.
READ_LEN: 150, 0 READ_START: 0, 150 READ_TYPE: SRA_READ_TYPE_BIOLOGICAL, SRA_READ_TYPE_TECHNICAL
By the way: anything beyond "-e 8" does not improve speed. We have tried - on 96-core machines. It just exhausts I/O bandwidth.
Thanks @wraetz. All good for now.
Hi,
I have the same problem as @cattapre which is that raw data was deposited as BAM instead of FASTQ. I am using the sra-toolkit version 2.11.0. Currently I am fetching the data using prefetch and then convert with fasterq-dump but due to the deposited data format it seems to not be able to split the reads. The command I am using for fasterq-dump is
fasterq-dump \
--split-files --include-technical \
--threads 6 \
--outfile SRX11966331_SRR15669711 \
SRR15669711
Is there a way to make this work for BAM deposited raw data?
There is no way to make this work, because the submitter of the data did not submit 2 reads per spot, only one. It does not happen very often, but sometimes the run-browser says that the data has 2 reads per spot - but the actual data has not. This run is not one of them - the run-browser clearly says that the run has only 1 spot per read. [https://www.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&acc=SRR15669711&display=metadata]
Okay that's a pity. But it clearly says paired layout so there must be two reads. Is there a preferred way to download such data such that one ends up with two read files? Thanks in advance
Sorry, it does not say paired layout anywhere on the run-browser. Where did you see that mentioned?
Ah I see - it says that for the experiment. Whoever entered that did so wrongly, or maybe some of the runs in the experiment are of paired layout, but not all... Anyway - if it is misleading please contact the data-curators at NCBI.
Okay thanks. I guess I will then just fetch the original BAMs from the ENA FTP.
Just for future visitors of this issue with the same problem. The answer of the SRA is as follows:
For 10X data we provision it "provisionally loaded" (eg, not processed for sratoolkit).
For the data processing for the 10x bams, are you using bamtofastq from 10x?
wget https://sra-pub-src-1.s3.amazonaws.com/SRR22746972/HIS_DIS_01M_LS_Der.bam.1
--2023-10-18 11:48:10-- https://sra-pub-src-1.s3.amazonaws.com/SRR22746972/HIS_DIS_01M_LS_Der.bam.1
Resolving sra-pub-src-1.s3.amazonaws.com (sra-pub-src-1.s3.amazonaws.com)... 54.231.226.33, 54.231.130.225, 52.217.42.148, ...
Connecting to sra-pub-src-1.s3.amazonaws.com (sra-pub-src-1.s3.amazonaws.com)|54.231.226.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12331634392 (11G) [binary/octet-stream]
Saving to: 'HIS_DIS_01M_LS_Der.bam.1'
100%[=============================================================================================================================>] 12,331,634,392 21.9MB/s in 10m 51s
2023-10-18 11:59:02 (18.1 MB/s) - 'HIS_DIS_01M_LS_Der.bam.1' saved [12331634392/12331634392]
bamtofastq_linux HIS_DIS_01M_LS_Der.bam.1 SRR22746972_origs
bamtofastq v1.4.1
Writing finished. Observed 233902606 read pairs. Wrote 233902606 read pairs
HIS_DIS_01M_LS_Der_GRCh38_0_1_AAAK7VYM5]$ ls
bamtofastq_S1_L001_R1_001.fastq.gz bamtofastq_S1_L001_R1_004.fastq.gz bamtofastq_S1_L001_R2_002.fastq.gz bamtofastq_S1_L001_R2_005.fastq.gz
bamtofastq_S1_L001_R1_002.fastq.gz bamtofastq_S1_L001_R1_005.fastq.gz bamtofastq_S1_L001_R2_003.fastq.gz
bamtofastq_S1_L001_R1_003.fastq.gz bamtofastq_S1_L001_R2_001.fastq.gz bamtofastq_S1_L001_R2_004.fastq.gz
The gist being that 10x processed BAMs can be converted back to FASTQ using their bamtofastq
tool. So you can just download the submitted BAM and the use bamtofastq
to convert the data back to its original state
The gist being that 10x processed BAMs can be converted back to FASTQ using their
bamtofastq
tool. So you can just download the submitted BAM and the usebamtofastq
to convert the data back to its original state
Correct - I provided more details on how to do that in the following StackExchange reply: https://bioinformatics.stackexchange.com/a/15523/4446
Hi,
I also encountered some issues or better to say, difficulties when downloading data with the SRA toolkit, especially when downloading data in batch mode (it took me some time to figure this out properly). However, on this particular issue here, there is one question from my side:
Besides, I am still a bit shocked at how bad the documentation is when data are made publicly available, and also to navigate through this field of bioinformatics, especially for the most critical step of pre-processing. It's a shame and in fact is a huge waste of time and money, if you always feel insecure about whether you have interpreted all things correctly for you own analysis. Sorry for the long comment, but felt like I have to mention this last point. Thanks for your help and support.
@kristinadjordjevic723
Looking at the samples description again one seems to be plain paired-end RNA-seq (the one with two reads) and the other one looks like single-cell RNA-seq where the third read probably is a barcode sequence (given the length and my experience with such data but you may wanna consult the methods of the study to be sure).
Regarding your frustration vent:
This is a common problem with any data-driven field (there is never enough metadata ;)) and in sequence bioinformatics especially because the repositories grew historically from being instantiated for microarray data (at least GEO with SRA as a sort of offspring I guess) to sequencing now and the MINSEQ rules are simply just the bare minimum unfortunately. I often approach this by going through all the sources I have for a sample like often the linked GEO repository and the publications methods and try to puzzle it together in this way. In any case preprocessing should always be done by you (at least I made the experience that this is often not well documented so I usually go straight from raw and do everything myself. It is a bit more work but saves you a lot of guesswork and additionally makes you more familiar with the data also in terms of quality etc)
@dmalzl Many thanks for your response! Yes, I agree with you, and starting from raw data is something I did from the very beginning when entering this field - my concern is not so much the repositories, as they grew historically and it's great we have them, the concern is more on the publication site, how little attention and time is sometimes given to the methods section, but still can be published in high impact journals. But this is for sure a bigger topic, but for here. Many thanks again for the prompt answer, it's clearer now for me with the technical replicates and reads.
I see and yes this is very annoying but something that unfortunately is part of our profession. But I agree that it is almost comical that especially the computational side often feels neglected. Have a great rest of the week though
@dmalzl Thank you for your response and assistance for the data question. Indeed both volume and sufficient specifics of metadata is a longtime concern. It is difficult to keep the metadata fields relevant without creating a controlled vocabulary that is too large for the typical submitter and quickly full of antiquated methods. SRA doesn't tend to add controlled vocabulary very frequently and so then the submitter text descriptions become important for understanding the data. However most SRA submissions are not manually curated and submitters will vary quite a bit how much they are willing (or sometimes permitted) to enter in the text descriptions of their metadata. Keeping the barrier to submission low while also keeping the data quality and usability high can sometimes be opposing requirements.
@dmalzl and @kristinadjordjevic723 the concerns you have both expressed regarding metadata and display are something SRA would like you talk with you more about if you are willing. However the github issues are intended for SRA Toolkit software rather than submissions questions. If you would be willing to talk about this more I would recommend emailing sra@ncbi.nlm.nih.gov. I will let the helpdesk know there may be emails incoming regarding this topic.
I am trying to download files from project GEO series GSE132044. From what I can tell, most (if not all) of the associated runs are paired-end with multiple fastq files deposited, but I only get a single fastq file every single time.
For example, an associated run SRR9169172 - when I use
I only get
SRR9169172.fastq
, while I expect 3 files. I have tried different runs from this project (most of these runs are single cell RNA-seq), and running the command with different--split
parameter values, but always with the same result.I was wondering if this could have been a similar issue as described here, where something similar happened because only 1 (bam) file associated with a 10X run had originally been deposited by the authors, but this does not seem to be the case because the "Original Format" of the files deposited is 3 fastq files, not a single bam.
So what am I missing here? Is this some sort of an error?
Edit: I also tried downloading the Original Format files using
prefetch --type fastq SRR9169172
but that also resulted in a single file.Edit 2: I updated my
sra-tools 2.10.0 -> 2.10.8
andprefetch --type fastq SRR9169172
now retrieves the 3 original fastq files, as expected!fasterq-dump
still restuls in a single fastq file, however. Both downloads run WAY faster than with the previoussra-tools
version.