Closed agitter closed 5 years ago
See also https://doi.org/10.1101/093534
In http://dx.doi.org/10.1101/079087, we presented adaptive models for calling somatic mutations in high-throughput sequencing data. These models were developed by training deep neural networks with semi-simulated data. In this continuation, I evaluate how such models can predict known somatic mutations in a real dataset. To address this question, I tested the approach using samples from the International Cancer Genome Consortium (ICGC) and the previously published ground-truth mutations (GoldSet). This evaluation revealed that training models with semi-simulation does produce models that exhibit strong performance in real datasets. I found a linear relationship between the performance observed on a semi-simulated validation set and independent ground-truth in the gold set (r^2=0.952, P<2-16). I also found that semi-simulation can be used to pre-train models before continuing training with true labels and that this pre-training improves model performance substantially on the real dataset compared to training models only with the real dataset. The best model pre-trained with semi-simulation achieved an AUC of 0.969 [0.957-0.982] (95% confidence interval) compared to 0.911 [0.890-0.932] when training with real labels only. These data demonstrate that semi-simulation can be a very effective approach to training filtering and ranking probabilistic models.
We covered one of these manuscripts but not both. Per the author, the second (with the continuation) contains validation on real data, which is relevant to this part of our Discussion section:
Similarly, a somatic mutation caller has been trained by injecting mutations into real sequencing datasets [346]. This method detected mutations in other semi-synthetic datasets but was not validated on real data.
http://doi.org/10.1101/079087 (http://biorxiv.org/content/early/2016/10/04/079087)