Counts from TEtranscript STAR differ from previous quantification

SBata commented 2 years ago

hi there

I just ran TEtranscripts using a simple 3 control vs. 3 treated RNASeq samples. I previously used those samples for other studies and quantified the reads using STAR. I used the parameters for STAR suggested in the tutorial, and the script is:

-- STAR for TEtranscript--
STAR --genomeDir STAR_Genome_2.7.9a \
--readFilesCommand gunzip -c \
--runThreadN 8 \
--sjdbGTFfile Homo_sapiens.GRCh38.91_v2chr.gtf \
--readFilesIn MyVarious.fastq.gz \
--outFileNamePrefix myOut \
--quantMode GeneCounts \
--winAnchorMultimapNmax 100 \
--outFilterMultimapNmax 100 \
--outSAMtype BAM Unsorted

-- TEtranscript --
TEtranscripts \
--format BAM \
--mode multi \
--treatment myPathToAlignedFiles_treatment.out.bam \
--control myPathToAlignedFiles_control_out.bam \
--GTF Homo_sapiens.GRCh38.91_v2chr.gtf \
--TE GRCh38_GENCODE_rmsk_TE.gtf \
--project test_human

-- Original STAR --
STAR --genomeDir pathToMySTARgenome \
--readFilesCommand gunzip -c \
--runThreadN 8 \
--sjdbGTFfile Homo_sapiens.GRCh38.91_v2chr.gtf \
--readFilesIn MyVarious.fastq.gz \
--outFileNamePrefix MyOutName \
--quantMode GeneCounts \
--chimSegmentMin 20 \
--outFilterScoreMinOverLread 0.3 \
--outFilterMatchNminOverLread 0.3

however, when I finished running TEtranscript , I could barely match the gene-level results from TEtranscript (in the xxx..gene_TE_analysis.txt file, and the filtered one), with the actual results I had before (using the Original STAR script above. like, it didn't even detect differential expression in my knockout gene (which I know is there, we test it in any way possible every time we run an experiment).

now, is this because of the parameters needed to run TEtranscript?

olivertam commented 2 years ago

Hi,

Thank you for your interest in the software. Here are some comments/suggestions:

We typically do not use --quantMode when we run our STAR, but I don't see that as a problem, so you can keep it in (it would just take longer, and somewhat redundant as TEtranscripts will try to independently quantify afterwards).
Check whether your library is stranded (you can do that by comparing columns 2, 3, 4 of your STAR geneCounts output for a highly expressed gene like Gapdh or beta actin, which correspond to unstranded, forward-stranded and reverse-stranded counts respectively). If you using verion 2.0.5 or later, the default mode for strandedness is no (unstranded). Thus, if your library is stranded, it is recommended to set --stranded [forward/reverse] accordingly.
Our recommended STAR parameters are mainly to account for multimappers. You can add back the --outFilterScoreMinOverLread and --outFilterMatchNminOverLread parameters to your new STAR run if you like, as our recommendations are intended to supplement your current STAR parameters.
You could also increase --winAnchorMultimapNmax to 150. I don't think this is the cause of the discrepancies, but we have used this on our own data without issue.
If none of the above seems to make a difference, would you mind sharing a couple of lines from both output files? Specifically, from the xxxx.cntTable, xxxx.gene_TE_analysis.txt (TEtranscripts), and the STAR geneCounts (both runs) where your gene of interest (you can mask the name) is expected to be differentially expressed.

Thanks.

SBata commented 2 years ago

Hi Oliver, thank you for the quick reply. I think you hit the nail in the head with the stranded option. the library is stranded and I'd need to use the 4th column from STAR. I went back checking the STAR output PerGeneout.tabper each sample, when I compare that to TEtranscript, what I get is the following:

--- TE transcript for my gene ---
GeneName    KD1 KD2 KD3 WT1 WT2 WT3
ENSG00000xxx    8   8   6   11  2   14

--- STAR output (based on the PerGene.out.tab)
GeneName    Unstr   S1  S2
ENSG00000xxx    1464    8   1456     >KD1
ENSG00000xxx    1513    8   1505     >KD2
ENSG00000xxx    1449    6   1443     >KD3
ENSG00000xxx    4435    11  4424     >WT1
ENSG00000xxx    3788    2   3786     >WT2
ENSG00000xxx    4555    14  4542     >WT3

I can tell you that the WT cells are glowing with this transcript so a count of 2 or 10 wouldn't even make sense biologically.

so this tells me that the in TEtranscript I should use the --stranded forward option to capture the second strand cDNA library, or the 4th column in STAR...makes sense?

olivertam commented 2 years ago

Hi,

S2 in the STAR output means that read 2 (in a paired end) is aligned in the direction of the transcript, which means that the library is reversed, as read 1 (single/paired end) would be in the opposite direction of the transcript. Thus, you would need the --stranded reverse option. Hope this post clears it up.

Thanks.

mhammell-laboratory / TEtranscripts

Counts from TEtranscript STAR differ from previous quantification #101