Open gwaybio opened 4 years ago
One section of a larger paper involves training a CNN on an "allelic series" of CRISPRi expression "titrations". That sentence is painful to read... in other words, in the assay, the authors systematically tinker with sgRNA sequences to toggle the impact of CRISPR knockdown on gene expression. This enables the authors to directly readout ground truth impact of modulating gene expression levels in a continuum between basal and knockout.
The input to the CNN are sgRNA sequences and their corresponding "relative activity". The relative activity is a single number representing a growth phenotype (essentially cell count). The authors train an ensemble of CNNs and evaluate their model on a heldout test set. They also validate their model by showing that it can also predict GFP expression in a CRISPRi allelic series targeting GFP as the "relative activity".
Two convolutional layers, followed by a max pooling layer, then a fully connected layer to predict activity. The authors train 20 different models and inference on new data happens by taking the mean prediction of the 20 models.
The CNN ensemble outperforms a logistic regression model (r^2 = 0.65 vs. r^2 = 0.52)
The authors show that mismatch position (along the sgRNA construct) and mismatch type (e.g. A -> T) were the most informative features. GC content also important, and intermediate location between end and PAM seemed to be also informative.
The authors also used their trained model to impute the sgRNA constructs that would most likely result in activity between a certain level. This helped with designing a more compact sgRNA library 🤯
This paper is a good example of a trend where Deep Learning is becoming more integrated into primarily assay development/molecular biology efforts
https://doi.org/10.1038/s41587-019-0387-5