sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
275 stars 68 forks source link

Trimmed FASTQ as inout to zUMIs #286

Closed hsriniva11 closed 3 years ago

hsriniva11 commented 3 years ago

I'm currently running Smart-Seq3 samples, I was wondering if there is any way to input trimmed fastq files (after trimming the mosaic end sequence)

cziegenhain commented 3 years ago

I typically trim Tn5 adapters during STAR mapping by passing '--clip3pAdapterSeq CTGTCTCTTATACACATCT' in the STAR parameters section of the yaml file.

hsriniva11 commented 3 years ago

That is in STAR, I want to trim the 5' end so it can find the 8bp-UMI sequence properly even when ATTGCGCAATG is not at the beginning of the sequence

cziegenhain commented 3 years ago

Sorry this sounds a bit cryptic to me you will need to describe properly what you mean and what you are planning to do. In the case of smartseq3, the UMI will always be preceded by the occurence of the pattern recognition sequence of the TSO oligo and zUMIs will take care of it appropriately. You can also set a number of mismatches you are willing to tolerate (see changelog of v2.9.5 https://github.com/sdparekh/zUMIs#changelog). Best, Christoph

hsriniva11 commented 3 years ago

The issue isn't the trimming here, I'm not able to input fastq files of uneven length to run Smart-Seq3 with zUMIs, is there any way I can do that?

cziegenhain commented 3 years ago

I'd be able to help better if you were a bit more clear here. If only the cDNA portion is of variable length, that is supported in zUMIs and you just set the cDNA range to the full read length (eg. cDNA(23-150) ) but you need go make sure that all reads would have at least 24 bases after your trimming, the length of the cDNA portion cannot become 0.

hsriniva11 commented 3 years ago

Here's a screenshot of the library structure

Screen Shot 2021-09-28 at 3 22 09 PM

These are the results from an adapter trimmed program for my FASTQ files

---------------------
First read: Adapter 1
---------------------

Sequence            Type       Length Trimmed (x)
------------------- ---------- ------ -----------
AGATGTGTATAAGAGACAG regular 5'     19      91,616

No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1

As you can see, for ~90k reads, the fastq starts with AGATGTGTATAAGAGACAG rather than ATTGCGCAATG or cDNA sequence. To handle this, I use a trimmer. Does this help?

cziegenhain commented 3 years ago

As you can imagine, I'm a slightly familiar with the library structure for Smartseq3 ;) Is this 90k reads out of many millions of reads?

Since this part of the sequence should be part of the Illumina sequencing primer for read 1, it's not expected that reads start with this sequence. It would point to an issue in the library or sequencing (eg. Concatamer). Thus I wouldn't recommend to include such reads in the UMI counting as they might be artefacts.

hsriniva11 commented 3 years ago

Thanks for your timely help!