Closed zeehio closed 5 months ago
Hi there,
I'm seeing behaviour, which could be similar to your issue, in data I received from several collaborators. The logs show for example:
Please check MultiQC report: 17/47 samples failed strandedness check
When I look in detail at the infer_experiment
files, I get:
This is PairEnd Data
Fraction of reads failed to determine: 0.2379
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0050
Fraction of reads explained by "1+-,1-+,2++,2--": 0.7571
This is PairEnd Data
Fraction of reads failed to determine: 0.2594
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0084
Fraction of reads explained by "1+-,1-+,2++,2--": 0.7323
This is PairEnd Data
Fraction of reads failed to determine: 0.2518
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0050
Fraction of reads explained by "1+-,1-+,2++,2--": 0.7432
This is PairEnd Data
Fraction of reads failed to determine: 0.2775
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0047
Fraction of reads explained by "1+-,1-+,2++,2--": 0.7178
This is PairEnd Data
Fraction of reads failed to determine: 0.2672
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0040
Fraction of reads explained by "1+-,1-+,2++,2--": 0.7288
Results like these are not covered by the examples in the rseqc docs on infer_experiment.py. I was wondering:
unstranded
, like in example 1 of the rseqc docs on infer_experiment.py?fr-firststrand
, but maybe this is incorrect because of the high amount of reads with undetermined strandedness
...undetermined strandedness
could come from and if this can be prevented in the future? Regards Mattias
Hi, thank you for posting these questions! I would love to hear from either of you, if you have any further thoughts on the issue! I too am getting similar warnings across the board. In all cases, salmon marks the experiment samples as "reverse" and infer_experiment.py identifies something like 28% reverse and 75% undetermined. Do you think I should be concerned, or proceed ahead trusting the salmon identification? Thank you!
The reason this comes up is because the auto strand setting comes from Salmon based on its pseudo-alignment against transcript sequences, while the final strandedness check is based on genomic alignments and RSeQC's assessment.
The main source of the discrepancy is the reads of undetermined strand in RSeQC which play a part in the the assessment the pipeline makes bases on those statistics, and (possibly) shouldn't. I've opened the above PR to discuss and/ or address this.
Tackled in https://github.com/nf-core/rnaseq/pull/1306
Wanted to revive this thread for a moment if possible. I have recently ran the rnaseq pipeline on some publicly available data, and based off of the change in #1306 it seems to me that my data likely has a large amount of potentially genomic DNA contamination. I expect antisense strandedness but am seeing 15-24% sense, 65-70% antisense, and 15-20% undetermined. Based on that distribution I expect that the samples are likely of poor quality?
If that is the new understanding from the fix then is there any way to continue working with this data? Or is that simply too low of quality? Also since the inferred strandedness is unstranded (likely due to the high sense and undetermined combined percentage) would the alignment be off, meaning that I need to re-run the pipeline changin auto strandedness to unstranded?
Thanks for any help on this!
So, the way I see it, there are two reasons for results like this (strandedness intermediate between unstranded and 100% stranded).
Scenario 1 would be characterised by large proportions of intergenic reads, in addition to the strandedness result, which you shouldn't see in Scenario 2. For example, the test data on the pipeline shows < 10% intergenic reads.
Should you run stranded or unstranded?
Scenario 2 should definitely be run unstranded, to capture the genuine transcriptomic signal scattered over the two strands, but it's harder to say for scenario 1. The contamination could be a big problem for the analysis, and you may not be able to use the data. But I'd be tempted to run it both ways and see what it does to your final findings once you've done the differential expression etc.
Should you used the data at all?
Up to you, depends how crucial the results are. If it's just a bit of exploratory hypothesis generation it's probably OK. But you wouldn't want to base a whole research programme on a lot datasets with genomic DNA contamination (if you decide that's what it is).
Ahh I see. Thanks for the clear answer!
Interestingly I assumed we would see something more similar to scenario 1, but there are less than 10% intergenic reads, closer to 7% on average (10 total samples), potentially indicating more likely a reaction issue.
I'm going to run unstranded on this test data set to better determine whether this is a genomic DNA issue or a reaction failure issue.
Makes sense on using the data at all, seems for my case it will be fine if it's a little contaminated, but I'll need to be careful about extrapolating too much from this.
Description of feature
Hi,
Thanks for all the work on this pipeline.
I have had to analyse several public datasets, and I plan to analyse more. Since the strandedness on such datasets is usually not provided, I use the
"strandedness: auto"
option in the pipeline to guess it.Quite often (apologies for not having statistics about that, I could try to get some if needed) I get
"WARNING: Fail Strand Check"
messages, and I find that Salmon had set the strandedness to "reverse" wheninfer_experiment.py
founds it to be "unstranded".When this happens, I set the strandedness to "unstranded" and rerun the pipeline.
Would it make sense for the pipeline to just reset the strandedness and rerun automatically?
Thanks again