lbroseus / SIRFindeR

Stable Estimation of Intron Retention levels from RNA-seq data
4 stars 0 forks source link

ResultsByIntron.txt - inconsistent intron reporting #8

Open cbohrson opened 3 years ago

cbohrson commented 3 years ago

I have run SIRFindeR on a subset of RNAseq available from GTEx. From the SIRFindeR vignette, I got the impression that info on the same introns would be available in "ResultsByIntron.txt". As you note in the vignette this would facilitate comparisons across samples.

Consistent with this, there are some rows in "ResultsByIntron.txt" that have all 0s for 'SpliceLeft', 'SpliceRight', 'SpliceMax', 'intronicCount', and 'SIRratio'.

However, after looking at the results more closely I have found that there are small inconsistencies - that is, some files entirely omit introns reported in other files. Comparing 2 samples, the difference is small (0-20) relative to large number of introns reported on (hundreds of thousands). However, across more samples, the union of all reported introns can differ from any one sample by a larger number. For example, looking at 100 GTEx samples, 50-100 introns were unreported in some compared with the union set.

Can you explain what causes an intron to be unreported in "ResultsByIntron.txt"?

All of my samples are run in the same way (same alignment pipeline and same call to computeSIRratio except for bamFile argument).

lbroseus commented 3 years ago

Hi Craig,

Yes, this is an expected behaviour.

According to the SIRFinder method, intronic intervals are defined using two pieces of information: we start from an annotation to infer "theoretical" intron genomic coordinates (common to all samples from a same species) and then make use of (sample-specific) RNA-seq data to refine those bounds (eg: when a novel exon is detected inside a theoretical intron, this intron will be segmented into the two intronic parts from either side of the exon). So that some introns to be analysed can be specific to a sample, depending on the "novel" splicing events that are detected from the corresponding RNA-seq data set.

To allow subsequent comparison between samples, we suggested to restrict only to those introns which are common to all samples included in your analysis. This is this "intersection" which is reported in file "ResultsByIntron.txt". For example, if a novel exon is detected in a theoretical intron but only in a subset of your samples, it will not be reported in "ResultsByIntron.txt", as its definition is not consistent between samples (the intron will be analysed as is in all samples where the splicing event was not detected, and split into two intronic parts in samples where the splicing event could be detected).

I hope this answers your question,

Lucile

cbohrson commented 3 years ago

I follow your answer mostly. Just wanted to clarify one thing. You said:

To allow subsequent comparison between samples, we suggested to restrict only to those introns which are common to all samples included in your analysis. This is this "intersection" which is reported in file "ResultsByIntron.txt".

Just to be 100% clear, all of the inconsistencies I mention are across "ResultsByIntron.txt" files. I.e., the introns in "ResultsByIntron.txt" are not intersections and appear to include sample-specific introns.

So really another way of phrasing my question is: should "ResultsByIntron.txt" files generated from SIRFindeR runs that use the same arguments except bamFile have exactly the same set of introns reported in "ResultsByIntron.txt"?

EDIT: To illustrate the issue, this is the # of lines across a few different samples for ResultsByIntron.txt and SIRratio.txt. My issue is that based on my prior understanding and your explanation I would expect this # to be exactly the same for all ResultsByIntron.txt files from the same organism, and possibly different for SIRratio.txt (due to sample-specific introns). But as you can see below, for multiple human samples that use the same GTF, the number of introns reported by ResultsByIntron.txt varies, and appears to always be consistent with SIRratio.txt.

269801 SRR1076369/SIRFindeR/ResultsByIntron.txt 269801 SRR1076369/SIRFindeR/SIRratio.txt

269804 SRR1076393/SIRFindeR/ResultsByIntron.txt 269804 SRR1076393/SIRFindeR/SIRratio.txt

269805 SRR1076417/SIRFindeR/ResultsByIntron.txt 269805 SRR1076417/SIRFindeR/SIRratio.txt

269805 SRR1076441/SIRFindeR/ResultsByIntron.txt 269805 SRR1076441/SIRFindeR/SIRratio.txt

269806 SRR1076465/SIRFindeR/ResultsByIntron.txt 269806 SRR1076465/SIRFindeR/SIRratio.txt

269795 SRR1076490/SIRFindeR/ResultsByIntron.txt 269795 SRR1076490/SIRFindeR/SIRratio.txt

cbohrson commented 3 years ago

@lbroseus gentle bump, if you have a minute I'd really appreciate your feedback on this

lbroseus commented 3 years ago

Hi Craig,

Sorry, I was mistaken. Yes, for a given sample, the file "ResultsByIntron.txt" will indeed likely contain sample-specific introns, which should exactly match those in "SIRratio.txt".

So as to compare samples, you might want to use the function buildExpSIRmatrix(), which merges only results on introns common to all samples (based on their "ResultsByIntron.txt" file).

Lucile