greenelab / pancancer-evaluation

Evaluating genome-wide prediction of driver mutations using pan-cancer data
BSD 3-Clause "New" or "Revised" License
9 stars 3 forks source link

Univariate feature selection for stratified CV #47

Closed jjc2718 closed 2 years ago

jjc2718 commented 2 years ago

We're testing an idea that features that "generalize better" across cancer types on the training set, for some definition of "generalize better", will allow us to develop models that are more robust to shifts in cancer type/tissue etc.

This experiment is sort-of a sanity check, since we're just doing stratified cross-validation (i.e. every cancer type in the test set is also in the training set). We mostly want to see that the new feature selection methods don't completely tank performance in this case, before moving on to the case where we're looking at generalization to specific cancer types, that may or may not be present in the training data.

We do seem to see that performance is fine for the new methods ("pancan_f_test" and "median_f_test" are the two that should be roughly close to the green box, which is no feature selection here):

image

review-notebook-app[bot] commented 2 years ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB