Summary.

The authors build a model of alternative splicing trained on exon-exon junction-spanning reads from the GTEx data.
They consider "splicing events", where each event is a single constitutive site, which can pair with any one of a set of alternative sites. They then predict the 'percent-spliced-in' or PSI for each alternative sites, which is simply the fraction of times that the constitutive site pairs with said alternative site.
They treat all splicing events as either alternative donors or alternative acceptor, and develop separate models for alternative donor and acceptor events.
For each alternative site, inputs include the sequence around the constitutive site, the alternative site, the mRNA sequence resulting from splicing the constitutive and alternative site, and a normalized measure of intron length.
For a given splicing event, they make predictions for each alternative site with a CNN that they call a "scoring network". Next, they feed the predictions for all of an event's alternative sites into a second network, with produces normalized PSI values (i.e. sum to 1, all less than 1) for each alternative site.
For training data, the authors include splicing events from GENCODE v19, as well as "de novo" splice sites that are supported by at least 2 tissues from 2 individuals in GTEx. Further, they select random negative sites (i.e. with PSI=0), as well as "decoy sites" (i.e. MaxEntScan score >3, but no support in GTEx or GENCODE).

Network structure.

They consider two models for the scoring network, including a CNN and a residual network. They also consider two models for the output network, including a communication network and an LSTM. They also try a network that uses a softmax for the output network.
For the communication network, they are taking the mean CNN output for all alternative sites (besides the current alternative site) and add it to the filters for the current alternative site.
It would be nice to see a comparison of all combinations of scoring network and output network.
There is no mention of a ResNet-26 in the paper they cite. As such, a table of the model architectures and hyperparameters would be nice.
A lot of the logic guiding the model selection remains to be demonstrated and formally quantified:
- decoy sites help the network learn more subtle signals in the input sequence.
- negative sites help the network learn the core splicing motifs.
- pooling prevents learning the core motifs that are sensitive to shifts.
Most of their networks are not available online.
It is unclear if performance is altered by considering a larger/smaller sequence window around each splice site.

The authors use the "positional bootstrap" to quantify uncertainty in their PSI values, but don't really explain how/if this information is used during data selection, training, or evaluation.
The PSI values are averaged across all tissues in GTEx. This might mask some heteroscedasticity in PSI values.
They do not include the code used to generate their data. This is problematic. For instance, it is unknown if they filtered for mitochondrial reads, which may contain self-splicing introns. Recall that there are duplicated brain samples in GTEx; if they did not consider this, then their filtering (2 tissues, 2 people) might not work as expected.

The top model is the CNN scoring network with a bidirectional LSTM output network. The CNN scoring network with a bidirectional LSTM output network is a close second.
The authors include ~60-80 decoy+negative sites per event, so R^2 with a uniform baseline will be a little inflated. Given that the authors are trying to learn the strength of different splice sites, it would have been nice to see R^2 calculated without decoy+negative sites. Kendall's Tau C or another tie-adjusted correlation coefficient might serve the same purpose (i.e. by down-weighting the tied zeros), while being easier to implement and apply to their existing outputs.
It is unknown if the donor and acceptor models generally agree on PSI values for cassette exons.
The notion that their model is "not fooled by cryptic splice sites" seems odd, since they didn't specifically look at cryptic splice sites. To determine if their model properly understood cryptic splice sites, they should have checked if the activated cryptic splice site had a much larger PSI than the reference sequence.

The authors extract motifs directly from their CNN filters, and show that many of them correspond to motifs relevant to splicing (e.g. splice sites, U2AF2 binding).
Several splicing motifs were not learned by their model (e.g. binding sites for RBFOX, NOVA, MBNL). Given the number of brain samples included in GTEx, the absence of NOVA is very surprising to me.
The authors only extracted motifs for the top performing model. It would have been nice to know if the motifs from the other models were largely similar or not.
The authors note that the order of splice sites explains the gap in performance between the softmax model and the others (i.e. which consider interactions between alternative sites). However, they did not explore the effects of ordering on their predictions.
It would have been nice to see a more in-depth evaluation of features learned by the model. For instance, they could have used in silico mutagenesis to see if the top-ranked motifs varied with ordering of alternative sites or the subset of sites considered.