Motivation
Alternative splice site selection is inherently competitive and the probability of a given splice site to be used also depends on the strength of neighboring sites. Here, we present a new model named the competitive splice site model (COSSMO), which explicitly accounts for these competitive effects and predicts the percent selected index (PSI) distribution over any number of putative splice sites. We model an alternative splicing event as the choice of a 3′ acceptor site conditional on a fixed upstream 5′ donor site or the choice of a 5′ donor site conditional on a fixed 3′ acceptor site. We build four different architectures that use convolutional layers, communication layers, long short-term memory and residual networks, respectively, to learn relevant motifs from sequence alone. We also construct a new dataset from genome annotations and RNA-Seq read data that we use to train our model.
Results:
COSSMO is able to predict the most frequently used splice site with an accuracy of 70% on unseen test data, and achieve an R2 of 0.6 in modeling the PSI distribution. We visualize the motifs that COSSMO learns from sequence and show that COSSMO recognizes the consensus splice site sequences and many known splicing factors with high specificity.
Availability and implementation:
Model predictions, our training dataset, and code are available from http://cossmo.genes.toronto.edu.
The authors build a model of alternative splicing trained on exon-exon junction-spanning reads from the GTEx data.
They consider "splicing events", where each event is a single constitutive site, which can pair with any one of a set of alternative sites. They then predict the 'percent-spliced-in' or PSI for each alternative sites, which is simply the fraction of times that the constitutive site pairs with said alternative site.
They treat all splicing events as either alternative donors or alternative acceptor, and develop separate models for alternative donor and acceptor events.
For each alternative site, inputs include the sequence around the constitutive site, the alternative site, the mRNA sequence resulting from splicing the constitutive and alternative site, and a normalized measure of intron length.
For a given splicing event, they make predictions for each alternative site with a CNN that they call a "scoring network". Next, they feed the predictions for all of an event's alternative sites into a second network, with produces normalized PSI values (i.e. sum to 1, all less than 1) for each alternative site.
For training data, the authors include splicing events from GENCODE v19, as well as "de novo" splice sites that are supported by at least 2 tissues from 2 individuals in GTEx. Further, they select random negative sites (i.e. with PSI=0), as well as "decoy sites" (i.e. MaxEntScan score >3, but no support in GTEx or GENCODE).
Network structure.
They consider two models for the scoring network, including a CNN and a residual network. They also consider two models for the output network, including a communication network and an LSTM. They also try a network that uses a softmax for the output network.
For the communication network, they are taking the mean CNN output for all alternative sites (besides the current alternative site) and add it to the filters for the current alternative site.
It would be nice to see a comparison of all combinations of scoring network and output network.
There is no mention of a ResNet-26 in the paper they cite. As such, a table of the model architectures and hyperparameters would be nice.
A lot of the logic guiding the model selection remains to be demonstrated and formally quantified:
decoy sites help the network learn more subtle signals in the input sequence.
negative sites help the network learn the core splicing motifs.
pooling prevents learning the core motifs that are sensitive to shifts.
Most of their networks are not available online.
It is unclear if performance is altered by considering a larger/smaller sequence window around each splice site.
Dataset.
The authors use the "positional bootstrap" to quantify uncertainty in their PSI values, but don't really explain how/if this information is used during data selection, training, or evaluation.
The PSI values are averaged across all tissues in GTEx. This might mask some heteroscedasticity in PSI values.
They do not include the code used to generate their data. This is problematic. For instance, it is unknown if they filtered for mitochondrial reads, which may contain self-splicing introns. Recall that there are duplicated brain samples in GTEx; if they did not consider this, then their filtering (2 tissues, 2 people) might not work as expected.
Evaluation.
The top model is the CNN scoring network with a bidirectional LSTM output network. The CNN scoring network with a bidirectional LSTM output network is a close second.
The authors include ~60-80 decoy+negative sites per event, so R^2 with a uniform baseline will be a little inflated. Given that the authors are trying to learn the strength of different splice sites, it would have been nice to see R^2 calculated without decoy+negative sites. Kendall's Tau C or another tie-adjusted correlation coefficient might serve the same purpose (i.e. by down-weighting the tied zeros), while being easier to implement and apply to their existing outputs.
It is unknown if the donor and acceptor models generally agree on PSI values for cassette exons.
The notion that their model is "not fooled by cryptic splice sites" seems odd, since they didn't specifically look at cryptic splice sites. To determine if their model properly understood cryptic splice sites, they should have checked if the activated cryptic splice site had a much larger PSI than the reference sequence.
Model interpretation.
The authors extract motifs directly from their CNN filters, and show that many of them correspond to motifs relevant to splicing (e.g. splice sites, U2AF2 binding).
Several splicing motifs were not learned by their model (e.g. binding sites for RBFOX, NOVA, MBNL). Given the number of brain samples included in GTEx, the absence of NOVA is very surprising to me.
The authors only extracted motifs for the top performing model. It would have been nice to know if the motifs from the other models were largely similar or not.
The authors note that the order of splice sites explains the gap in performance between the softmax model and the others (i.e. which consider interactions between alternative sites). However, they did not explore the effects of ordering on their predictions.
It would have been nice to see a more in-depth evaluation of features learned by the model. For instance, they could have used in silico mutagenesis to see if the top-ranked motifs varied with ordering of alternative sites or the subset of sites considered.
https://doi.org/10.1093/bioinformatics/bty244