We're testing an idea that features that "generalize better" across cancer types on the training set, for some definition of "generalize better", will allow us to develop models that are more robust to shifts in cancer type/tissue etc.
This experiment is sort-of a sanity check, since we're just doing stratified cross-validation (i.e. every cancer type in the test set is also in the training set). We mostly want to see that the new feature selection methods don't completely tank performance in this case, before moving on to the case where we're looking at generalization to specific cancer types, that may or may not be present in the training data.
We do seem to see that performance is fine for the new methods ("pancan_f_test" and "median_f_test" are the two that should be roughly close to the green box, which is no feature selection here):
We're testing an idea that features that "generalize better" across cancer types on the training set, for some definition of "generalize better", will allow us to develop models that are more robust to shifts in cancer type/tissue etc.
This experiment is sort-of a sanity check, since we're just doing stratified cross-validation (i.e. every cancer type in the test set is also in the training set). We mostly want to see that the new feature selection methods don't completely tank performance in this case, before moving on to the case where we're looking at generalization to specific cancer types, that may or may not be present in the training data.
We do seem to see that performance is fine for the new methods ("pancan_f_test" and "median_f_test" are the two that should be roughly close to the green box, which is no feature selection here):