interpretation of pangolin results

MiqG commented 2 years ago

Hi!

thanks for making such a great tool!

After reading the paper and as I have started using pangolin, I have some questions on how one should interpret the results.

For example, following your usage example, I ran the following version of your BRCA example: (Note that I have added a line without a mutation.)

gene,CHROM,POS,REF,ALT
BRCA1,17,41276135,T,T
BRCA1,17,41276135,T,G
BRCA1,17,41276135,T,C

This gives the following output:

gene,CHROM,POS,REF,ALT,Pangolin
BRCA1,17,41276135,T,T,ENSG00000012048.23_1|-50:0.0|-50:0.0|Warnings:
BRCA1,17,41276135,T,G,ENSG00000012048.23_1|-3:0.15|-16:-0.04|Warnings:
BRCA1,17,41276135,T,C,ENSG00000012048.23_1|-1:0.86|-3:-0.83|Warnings:

From your README (gene|pos:largest_increase|pos:largest_decrease|), I understand pos are the positions with maximum and minimum splicing scores increase and decrease relative to the site where we introduced the mutation.

Then, in the first case (no mutation), should pos be 0? (of course, this is a special case, I just want to be sure that I understand the output)
How should one interpret the scores for a mutation when both lead to similarly high values?

For example, the 2nd mutation BRCA1,17,41276135,T,G,ENSG00000012048.23_1|-3:0.15|-16:-0.04|Warnings: generates a splicing score gain at position -3 (w.r.t. the mutation site 41276135) larger than the splicing score loss at position -16. So, in this case, I would expect an increase in PSI right?

However, for the 3rd case, BRCA1,17,41276135,T,C,ENSG00000012048.23_1|-1:0.86|-3:-0.83|Warnings:, the mutation changes de splice core similarly in both directions and in close sites. Then, should I interpreted that this mutation will change the PSI because splicing sites of similar strength appear one compensating for the other?
Finally, in your paper, you show two examples predicting the effect of mutations in splicing for the Julien et al. and the Baeza-Centurion et al. datasets. In the methods, you explained that, for each variant, you scored the 5' and 3' splice sites. How did you obtain the score gains/losses in both cases separately? Did you do it by increasing the distance parameter enough to span both sites?

Thank you very much in advance for your help! Cheers,

Miquel

Ni-Ar commented 2 years ago

I have the exact same question regarding interpretation of the output pos:scores nomenclature.

tkzeng commented 2 years ago

Pos gets set to the first minimum or maximum position in cases of ties; in the first example, it looks like everything is predicted to have no change, which is why the first considered position (-50) is returned.
Generally, I don't think it is straightforward to interpret cases where a mutation leads to multiple changes. If you see concordant splicing changes at both sites of an exon, you might be able to interpret it as a change in PSI (it depends on how complex the splicing pattern for the gene can be). If they are opposite changes at the ends of an exon, then it is harder to interpret. In the 3rd case though, since the splice sites are close and the changes are both large, I think your interpretation is correct.
For these datasets, I ran Pangolin on sequences centered on the 5' sites and also on sequences centered on the 3' sites. Then I specifically looked at the change in splice score at those sites. You probably want to set --score_cutoff to report all splice changes past a threshold and examine the positions corresponding to the 5' and 3' sites (you also will need to set the distance parameter to be large enough).

MiqG commented 2 years ago

Alright! Thanks, @tkzeng ! I will keep experimenting.

tkzeng / Pangolin

interpretation of pangolin results #4