This PR implements microsatellite instability (MSI) status prediction across cancer types, as another prediction problem to study for the LASSO parameter range experiments I've been looking at. TCGA has MSI classifications for COAD, READ, STAD, and UCEC, so these experiments use those 4 cancer types.
In general, lower LASSO penalties (less regularization) tend to work better for this problem, both for performance on the training cancer types and for generalization to unseen cancer types.
The script at 10_msi_prediction/download_msi.ipynb was already reviewed as part of the https://github.com/greenelab/mpmp repo, so you don't need to look too closely at it. The other scripts are new, but based heavily on existing scripts used for the mutation prediction experiments.
This PR implements microsatellite instability (MSI) status prediction across cancer types, as another prediction problem to study for the LASSO parameter range experiments I've been looking at. TCGA has MSI classifications for COAD, READ, STAD, and UCEC, so these experiments use those 4 cancer types.
In general, lower LASSO penalties (less regularization) tend to work better for this problem, both for performance on the training cancer types and for generalization to unseen cancer types.
The script at
10_msi_prediction/download_msi.ipynb
was already reviewed as part of the https://github.com/greenelab/mpmp repo, so you don't need to look too closely at it. The other scripts are new, but based heavily on existing scripts used for the mutation prediction experiments.