Following on #64, this PR continues to build the infrastructure to run and analyze similar experiments with model size/LASSO penalty variation across multiple cancer genes. In this case, I ran the same experiments for TP53, EGFR, ATRX and CDKN2A (all common cancer drivers with diverse cancer types included in the training set).
The script at 02_cancer_type_classification/lasso_range_gene.ipynb breaks down performance across cancer types for a single gene. Here's an example for CDKN2A:
This shows that for most cancer types, the lower LASSO parameters (less regularization/more features) perform at least as well as the higher LASSO parameters (more regularization/fewer features), even when the test cancer type is held out of the training set.
By contrast, the script at 02_cancer_type_classification/lasso_range_all.ipynb breaks down performance across all the genes in the dataset:
This shows that across these 4 genes, there tends to be a positive correlation between the number of features in the model and generalization performance (i.e. in general, more features in the model => better generalization). The next step is to scale this up to the set of several hundred driver genes from my last paper, and find a clean way to analyze/visualize it (i.e. not a box plot for each of >100 genes...)
Following on #64, this PR continues to build the infrastructure to run and analyze similar experiments with model size/LASSO penalty variation across multiple cancer genes. In this case, I ran the same experiments for TP53, EGFR, ATRX and CDKN2A (all common cancer drivers with diverse cancer types included in the training set).
The script at
02_cancer_type_classification/lasso_range_gene.ipynb
breaks down performance across cancer types for a single gene. Here's an example for CDKN2A:This shows that for most cancer types, the lower LASSO parameters (less regularization/more features) perform at least as well as the higher LASSO parameters (more regularization/fewer features), even when the test cancer type is held out of the training set.
By contrast, the script at
02_cancer_type_classification/lasso_range_all.ipynb
breaks down performance across all the genes in the dataset:This shows that across these 4 genes, there tends to be a positive correlation between the number of features in the model and generalization performance (i.e. in general, more features in the model => better generalization). The next step is to scale this up to the set of several hundred driver genes from my last paper, and find a clean way to analyze/visualize it (i.e. not a box plot for each of >100 genes...)