.pos File For Splicing Molecular Phenotypes

maegsul commented 4 years ago

Hi,

First of all, thank you developing TWAS/FUSION. I am running it also for integrating GWAS and sQTL catalogues (including CMC splicing weights shared here: http://gusevlab.org/projects/fusion/#reference-functional-data ; along with the splicing weights I've calculated from other datasets publicly available or generated in-house).

My question is about the .pos file. As far as I understand, P0 & P1 columns there actually indicate TSS and TES sites of the gene for expression weights. Similarly, there is always one gene name on the second column as expression phenotypes are straightforward usually.

However, for splicing molecular phenotypes this is a bit different: as we calculate splicing weights for a X junction in a Y gene (that, let's say, has 5 other junctions in the same splicing cluster or in different splicing cluster), should we assign P0 and P1 now based on the coordinates of X junction? Or P0 and P1 should be TSS and TES of Y gene again?

I realized that .pos file of CMC weights still has TSS & TES. This is as it should be?

I think for splicing weights P0 and P1 in the .pos file should be coordinates of the exact splice junction for which we calculated RDat, but can you please comment on that if this is correct? (and then, should CMC .pos file be modified? Or this does not have any affect on the results?)

Similarly, for complex splicing events where Leafcutter (or any other splice analysis tools) annotates 2 genes for the respective cluster, is it safe to annotate gene names in the .pos file as "HERPUD2,AC007551.3"? Also some clusters have "NA" gene annotation.

Thanks in advance!

sashagusev commented 4 years ago

Hi, the pos file positions are there almost entirely for reference and/or visualizing the outputs. All of the algorithms are run on the specific SNPs that are in the corresponding RDat file and so essentially ignore the listed positions.

In general we have kept positions the same between gene and splicing files so that the same locus is when visualizing/comparing associations, as we are typically interested in whether the splicing effect is correlated/independent with the overall gene effect.

The same goes for gene names, which are only used for visualizing the results and can be in any format.

maegsul commented 4 years ago

Thank you @sashagusev for clarifying it for me.

To add onto this, for instance here on the top figure https://www.nature.com/articles/s41588-018-0092-1/figures/2 dark blue colour stands for CMC splicing TWAS associations. If we look at LRRN3 hit on chr7, there seem to be 3 significant associations with similar Z scores. Does that mean there were actually 3 different splice junctions on LRRN3 that came out as significant TWAS hits? In other words, can we say that there is not an overall (splicing) TWAS score for LRRN3 gene as these 3 junctions are considered as different phenotypes?

sashagusev commented 4 years ago

As you note, in cases such as these where there are multiple intron-excisions associated at comparable levels, the likely mechanism is either:

One junction QTL driving the TWAS association and negatively correlated to the usage of other junctions (which tag the TWAS association)
Multiple independent junction QTLs leading to multiple independent TWAS associations
Total expression driving the association and being tagged by one or multiple junction QTLs.

It is not always possible to disentangle the mechanism because of the different levels of noise on each of these molecular features. But I would recommend running the FUSION conditional analysis steps to evaluate whether these are correlated or independent events. Also, if the true scenario is (3) then you should expect to see a TWAS association for the overall gene model that is more significant than any of the junction models (with the caveat that it is possible overall expression was somehow noisier and the prediction model was missed).

gusevlab / fusion_twas

.pos File For Splicing Molecular Phenotypes #17