ncbi / sra-tools

SRA Tools
Other
1.07k stars 243 forks source link

Numerous scRNA-seq Datasets Having Potential Corruption Issues Upon Conversion to .SRA Format #889

Open Sandman-1 opened 6 months ago

Sandman-1 commented 6 months ago

Hello. Thank you all for building this publicly available database of genomic and related information. I’m attempting to use a number of scRNA-seq datasets published and uploaded to the SRA by different lab groups for a meta analysis.

Upon further examination, I am finding that a number of these datasets are facing potential corruption issues. To be specific, I believe that the .SRA files derived from originally uploaded fastqs are flawed, but the originally uploaded fastqs are not. Let me explain.

  1. The datasets I am interested in have either been created using 10X Genomics or Singleron Technologies. Upon sequencing of the resulting libraries, paired-end reads are generated and converted from base call files to fastqs labeled as R1 or R2 to denote the strand of cDNA from which their sequences come from. Additionally, some fastqs that are uploaded to the SRA are also index fastqs labeled as I1 or I2.
  2. To acquire fastqs from the SRA, I have been using a number of different combinations of fasterq-dump with and without prefetch, including: prefetch + fasterq-dump —split-files , prefetch + fasterq-dump —split-files —include-technical , and prefetch + fasterq-dump —split-3 .
  3. I find that for any dataset where index fastqs are part of the accession, both the I1 and R1 fastqs are marked as technical reads, only the split-files + include-technical parameters work in allowing fasterq-dump to output 3 files. These files from 2 separate datasets (GSE132771 and GSE189357) have not run successfully in CellRanger preprocessing, however, and the resulting error message always says that the fastq record for one of the files is too long, pointing at a specific line in that specific file.
  4. This prompted me to reach out to 10X Genomics for clarification, to which they responded that any fastq set resulting in this error message is likely corrupted and incomplete. They recommended I check the line number in each of my files to see if it was the same across all files for each particular sample using the command zcat | wc -l. I have been doing this for a number of samples across both aforementioned accession numbers, and I have yet to come across a sample where the line number is even same across R1 and R2 files alone, much less all three I1, R1, and R2 files. This indicates that there is a different number of reads stored in each fastq set.
  5. Now, since there is not a way that I know of to remove singletons using fast(er)q-dump from files with index and/or technical reads in paired-end datasets, I thought this problem revolves around the presence of unpaired reads in the originally submitted data and was thus unique to such datasets. I then shifted over to three other datasets where only R1 and R2 files were present (GSE203360, GSE135893, and GSE136831). I then used prefetch + fasterq-dump with the split-3 parameter on samples from these accessions and still found differing numbers of lines. I was quite confused because as I said, I thought that the only problem was the presence of unpaired reads within the fastqs themselves. Looking back on it, this is probably where I should have realized that something else was wrong, but regardless I tried the following.
  6. A manual tool for finding read mates from fastqs exists called fastq-pair. I tried using the program for a number of samples from all 5 of these datasets and found that first of all, the program runs extremely slowly given the large size of this data, and second of all, it still didn’t work entirely as expected. For the first human sample from GSE132771, for example, fastq-pair flagged over half of both the _2 (R1) and _3 (R2) fastqs created by fasterq-dump as being comprised of singletons. It also still didn’t result in paired files of equal length either, with a line number difference of around 300.
  7. At this point, I transferred the original files from this first human sample within GSE132771 to my Google Cloud account to see if they were the source of the problem, and upon downloading them I found that the line numbers for all three files (I1, R1, and R2) were exactly the same: 195479260. A picture of this is attached.

IMG_2987

After all this, I have concluded that there likely is a problem with the existing .SRA files for the vast majority, if not all, samples from several of scRNA-seq datasets in the SRA. I would be happy to provide more information about this matter. I used version 3.0.5 of the SRA toolkit in Linux and ran it on the Linux subsystem for my Windows computer. I sent an email about one of the datasets (GSE189357) to sra@ncbi.nlh.nih.gov last Tuesday, to which I have not gotten a response yet. However, now that I am realizing the potential scale of this file issue within the SRA, I thought I would make a GitHub post about it in case other users have been experiencing similar problems. I would greatly appreciate assistance and feedback from anyone for this matter.

Skanda Hebbale Medical School Candidate Computational Biologist in the Lab of Dr. Luke Norton at UTHealth San Antonio

Sandman-1 commented 6 months ago

Hello, I hope everyone had a festive holiday yesterday. Just wondering if anyone was able to look into this issue.

I found yet another dataset that might be facing this issue, GSE154826. Out of the 96 scRNA-seq samples, the first run already generates fastqs of unequal length upon using prefetch + fasterq-dump. I am now using the latest version of sra toolkit as well (v3.0.10).

IMG_3005

It would be greatly appreciated if someone can resolve this matter of potential corruption upon conversion from original to database format files. GSE154826 is a dataset where recovering originally submitted files would require downloading 7 TB of fastqs from a cloud service provider. This will cost an insane amount of money that I’m not prepared to pay when the data is supposed to be publicly available.