trinityrnaseq / trinityrnaseq

Trinity RNA-Seq de novo transcriptome assembly
BSD 3-Clause "New" or "Revised" License
838 stars 320 forks source link

ExN50 stats 'worse' in strand-specific analysis #1270

Closed marievalerie closed 1 year ago

marievalerie commented 1 year ago

Dear Brian,

I accidentally analysed i.e., assembly and mapping, stranded RNAseq data (dUTP protocol) as unstranded. After noticing, I reanalysed the data set in strand-specific mode. When comparing the ExN50 stats of both assemblies, I was wondering why the appropriate parameters resulted in 'worse' ExN50 statistics: the max. ExN50 value is smaller but also the graph looks more like it was not sufficiently deep sequenced. I inspected other data sets and found that this is consistent through several of my trinity assemblies (invertebrates, mainly insects). Further, I saw that the shape of the curve slightly depends on the used tool for abundance estimation.

Normally, I do not worry too much about the E90N50 value, unless it is extremely low or the curve is very strange. But now I am not sure whether I see here really the property of my data/assembly quality, if the curve is shaped by analyses parameters and tools.

The question is now: Do I overinterpret the ExN50 value? I only stumbled across this because I wanted to shift from RSEM to salmon for this data set. If I had only abundance estimates from RSEM, I would have probably proceeded with the stranded assemblies based on the ExN50 stats. Or do these plots indicate, that I should explore this further and maybe analyse the data as unstranded and/or rely rather on RSEM than on pseudoalignment tools?

ExN50_trinity_assemblies

Thanks and all the best,

Marie

brianjohnhaas commented 1 year ago

Hi Marie,

I think the issue with the unstranded quant with stranded reconstructions is that the strand-specificity is usually not at 100% and instead more with ~1% that's not converted, and the non-converted reads end up assembling separately as 'fake antisense' contigs for the highly expressed transcripts

I wouldn't be overly concerned about the salmon vs. rsem differences. I've moved to using salmon myself because it's so much faster and has little disk footprint afterwards, but RSEM has long been considered one of the best and if you're happy with it, feel free to keep using it.

Your plots look good! Your ExN50 peaks ~1500 are certainly more useful than those based on100% of the data (~500 bases).

best,

~b

On Sun, Mar 12, 2023 at 5:42 PM marievalerie @.***> wrote:

Dear Brian,

I accidentally analysed i.e., assembly and mapping, stranded RNAseq data (dUTP protocol) as unstranded. After noticing, I reanalysed the data set in strand-specific mode. When comparing the ExN50 stats of both assemblies, I was wondering why the appropriate parameters resulted in 'worse' ExN50 statistics: the max. ExN50 value is smaller but also the graph looks more like it was not sufficiently deep sequenced. I inspected other data sets and found that this is consistent through several of my trinity assemblies (invertebrates, mainly insects). Further, I saw that the shape of the curve slightly depends on the used tool for abundance estimation.

Normally, I do not worry too much about the E90N50 value, unless it is extremely low or the curve is very strange. But now I am not sure whether I see here really the property of my data/assembly quality, if the curve is shaped by analyses parameters and tools.

The question is now: Do I overinterpret the ExN50 value? I only stumbled across this because I wanted to shift from RSEM to salmon for this data set. If I had only abundance estimates from RSEM, I would have probably proceeded with the stranded assemblies based on the ExN50 stats. Or do these plots indicate, that I should explore this further and maybe analyse the data as unstranded and/or rely rather on RSEM than on pseudoalignment tools?

[image: ExN50_trinity_assemblies] https://user-images.githubusercontent.com/79691910/224574984-f8ef21ef-5951-4ca6-a99d-041e5f0e808b.png

Thanks and all the best,

Marie

— Reply to this email directly, view it on GitHub https://github.com/trinityrnaseq/trinityrnaseq/issues/1270, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX3T7KOPCW6SOJ3AUILW3Y7MHANCNFSM6AAAAAAVYLHSUY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

marievalerie commented 1 year ago

thanks for the very helpful and fast reply!