Not able to extract per cell data from smart-seq2 data (continuing issue from #266)

tkarginov commented 1 year ago

Hi!

I was granted an access to phs001680.v1.p1 data from dbGap - original paper here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6641984/. I downloaded all the scRNAseq (Smart-seq2) files with prefetch command and I have been trying to convert the SRA files to either FASTQ files (using fastq-dump), but I have not been able perform this step as I hoped. This is the same issue that someone had before (#266) that appears unresolved.

I tried to run fasterq-dump with --split-files on one of the SRR files and still ended up with 2 fastq files as suggested in the thread $ fastq-dump -G --split-files SRR9611283

or split the sam-dump'ed bam with samtools split flag $ samtools split SRR9611283.bam

Neither of these options worked - they still generate 1-2 FASTQ/BAM files instead of the expected per cell file. I know there must be multiple cells in each seq run because the sample ID matrix for TPMs deposited in GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE120575) has ~400-600 cells per site totaling in 16,000 cells.

Per the reported study they combined all Smartseq2 libraries as expected and sequenced - and after contacting the author of the study apparently all FASTQ files were deposited for individual cells. However, each "tumor site" has only 2 FASTQ files and I still can't seem to find any identifiers or reference for these sets.

Ideally, I would like to split the SRA files to "one file per cell".

I have also tried contacting an SRA and dbGAP curator, one of whom suggested that there are individual bam files that are saved and to access these via AWS -- this also did not work as when I went to download the SRA run into my AWS bucket, there is no option to split this by original bam files etc. I simply get the original SRA file which is the same file I already tried to operate with on my local cluster with prefetch.

I think I am missing something here and am overlooking something. Any help would be greatly appreciated.

Thanks very much! Tima

durbrow commented 1 year ago

"Combined libraries from 384 cells were then sequenced on a NextSeq 500 sequencer (Illumina), using paired-end 38-base reads"

This is how the data is stored in SRA. There may be some other metadata that was stored to identify the 384 cells, but I can't see the data to tell you.

tkarginov commented 1 year ago

I've dug rather extensively into the protocols for this dataset including finding a sample data sheet of custom index barcodes in a prior paper but doesn't help with the current paper. After scouring the entirety of the provided metadata as pointed out, trying commands including fastq-dump --split-files --origfmt --defline-seq '@rd.$si:$sg:$sn' to get the original headers as well as fasterq-dump --split-files --include-technical to see if a technical read was included, there is still no provided information on indices or splitting these cells. Any other thoughts? Thanks, Tima

tkarginov commented 1 year ago

For anyone accessing this dataset in the future, we've resolved the issue. Files were incorrectly deposited to SRA as pooled BAM files and lost all barcoding information. The dataset has now been split by individual BAM file, with each BAM file corresponding to a cell. The authors did not deposit fastq files so only the BAM files are available at this time, but it is possible to obtain fastq files using the samtools bam2fq function. Cheers, Tima

ncbi / sra-tools

Not able to extract per cell data from smart-seq2 data (continuing issue from #266) #741