Smart-Seq2 data - Githubissues

snaqvi1990 commented 7 years ago

Hi,

I am currently dealing with a version of Smart-Seq2 data which is 150x150bp and has, on read 2, positions 1-8 as the cellular barcode and positions 9-16 as the UMI. For pseudoalignment with kallisto, I'm wondering if you have any thoughts/experience on aligning with just the first or both reads. On one hand, I'd imagine that having both reads would increase mappability, but then on the other hand they would ultimately be collapsed to the same UMI.

Thanks, Sahin

vals commented 7 years ago

Hi Sahin,

With Smart-seq2, each transcript will generate many cDNA fragments. At what stage are you adding the UMIs to fragments? I have never heard of data like this before, and you will definitely need to do some custom processing.

Depending on when and where UMIs are added, you can do different things. If you describe exactly the steps of how the Smart-seq2 libraries were made, I can give some guidance.

In general, Kallisto works much better with paired reads.

snaqvi1990 commented 7 years ago

Hi Valentine,

Thanks for the reply. I'm actually reanalyzing data from a published paper: . They actually claim to use a "modified Smart-seq2 protocol," pasting from the methods below: http://www.cell.com/cell-stem-cell/abstract/S1934-5909(17)30174-1

Briefly, after MACS and FACS purification, a single FGC or gonadal somatic cell was placed into the lysis buffer by mouth pipette. The reverse transcription reaction was performed with 25 nt oligo(dT) primer anchored with an 8 nt cell-spe- cific barcode (Table S2) and 8 nt unique molecular identifiers (UMIs) (Hashimshony et al., 2012; Islam et al., 2012, 2014; Klein et al., 2015). After the first-strand synthesis, the second-strand cDNAs were synthesized, and the cDNAs were amplified by 17 cycles of PCR. The amplified cDNAs of the single cells were then pooled together for the following steps. Biotinylated pre-indexed primers were used to further amplify the PCR product by an additional 4 cycles of PCR to introduce biotin tags to the 30 ends of the amplified cDNAs. Approximately 300 ng cDNA was sheared to approximately 300 bp by Covaris S2, and the 30 terminal of the cDNA was captured by Dynabeads! MyOne Streptavidin C1 beads (Thermo Fisher). The RNA-seq library was constructed using a Kapa Hyper Prep Kit (Kapa Biosystems) and subjected to 150 bp paired-end sequencing on an Illumina HiSeq 4000 platform (sequenced by Novogene).

roryk commented 7 years ago

One way you could handle this would be to use the UMI to call consensus reads with https://github.com/fulcrumgenomics/fgbio and then run whatever standard RNA-seq quantification pipeline you want downstream.

vals commented 7 years ago

Hi Sahin,

Cool! I didn't know this protocol. Since they are capturing the 3' ends of cDNA with beads, there will only be one fragment per transcript.

Some other protocols have a little information in the other read as well, but I elected to ignore this for the fastqtransform because it was usually just ~10 bases or so. But here it makes sense to use that information.

It would make sense to just create two transformed fastq's (for forward and reverse), then map the pair with RapMap. We should add a filter in tagcount that ignores alignment records from the reverse read when counting.

If it's a one-off thing, I would take the pseudosam from RapMap and filter out all the reverse read alignment records before counting with tagcount. I think you can do this with Samtools, but you can probably also do this with grep.

snaqvi1990 commented 7 years ago

Hi Valentine,

Sounds good, I will try filtering the sam for now.

I have not come across RapMap before (have stuck to Kallisto). Glancing at the RapMap page, it seems like they should be more or less equivalent in this setting, although I'm not sure about how quasi-mapping would/wouldn't make a difference here. Is there a particular reason you would recommend RapMap over Kallisto?

Thanks a lot for your help.

vals commented 7 years ago

Hi Sahin,

Kallisto should be fine, I'm just more familiar with the output of RapMap. When I tested things a couple of years ago RapMap was faster and used less RAM, but I haven't compared recent versions. (Back then Kallisto didn't have the --pseudo command for example.)

roryk commented 5 years ago

Seems like this is closeable.

vals / umis

Smart-Seq2 data #40