Open Sandman-1 opened 6 months ago
Hello, I hope everyone had a festive holiday yesterday. Just wondering if anyone was able to look into this issue.
I found yet another dataset that might be facing this issue, GSE154826. Out of the 96 scRNA-seq samples, the first run already generates fastqs of unequal length upon using prefetch + fasterq-dump. I am now using the latest version of sra toolkit as well (v3.0.10).
It would be greatly appreciated if someone can resolve this matter of potential corruption upon conversion from original to database format files. GSE154826 is a dataset where recovering originally submitted files would require downloading 7 TB of fastqs from a cloud service provider. This will cost an insane amount of money that I’m not prepared to pay when the data is supposed to be publicly available.
Hello. Thank you all for building this publicly available database of genomic and related information. I’m attempting to use a number of scRNA-seq datasets published and uploaded to the SRA by different lab groups for a meta analysis.
Upon further examination, I am finding that a number of these datasets are facing potential corruption issues. To be specific, I believe that the .SRA files derived from originally uploaded fastqs are flawed, but the originally uploaded fastqs are not. Let me explain.
After all this, I have concluded that there likely is a problem with the existing .SRA files for the vast majority, if not all, samples from several of scRNA-seq datasets in the SRA. I would be happy to provide more information about this matter. I used version 3.0.5 of the SRA toolkit in Linux and ran it on the Linux subsystem for my Windows computer. I sent an email about one of the datasets (GSE189357) to sra@ncbi.nlh.nih.gov last Tuesday, to which I have not gotten a response yet. However, now that I am realizing the potential scale of this file issue within the SRA, I thought I would make a GitHub post about it in case other users have been experiencing similar problems. I would greatly appreciate assistance and feedback from anyone for this matter.
Skanda Hebbale Medical School Candidate Computational Biologist in the Lab of Dr. Luke Norton at UTHealth San Antonio