This addresses an issue with certain genes that were present in the results for some seeds/data types, but not others. See #44 for some background and examples.
After exploring this, there ended up being a few separate issues, all of which are fixed in this PR:
Shuffling the labels in the baseline models sometimes led to test sets that didn't have any mutated samples, especially for really imbalanced labels. Now I'm splitting train/test sets then shuffling them independently, to maintain the same label balance as in the un-shuffled models.
I was using sets instead of lists in mpmp/utilities/tcga_utilities.py for the list of valid_samples - switched to passing around lists to make the order deterministic
Also in mpmp/utilities/tcga_utilities.py, I was applying our cancer type mutation filter (>15 samples mutated, >5% samples mutated) before removing hyper-mutated samples. This led to some genes that had fewer than 15 positive labels in the final dataset, some of which would drop out further down the pipeline. I switched the order of the hyper-mutated samples filter and the cancer type filter, which makes things much more predictable/consistent.
The results of all these changes are that we end up with fewer valid genes total (85 for gene expression, 84 for methylation, 75 for all data types -> see 01_explore_data/missing_genes.ipynb), but these genes are now consistent across data types and across signal/shuffled models within a given experiment. The end result is that our conclusions should be more robust and less prone to spurious signal/small sample sizes.
Looking at the genes that get dropped in the new experiments, most of them seem to be low signal/hard to predict (near the origin in our volcano plots):
I'm currently re-running all of our experiments with these updated gene sets, which will be a separate future PR (although I don't expect it to change our conclusions).
This addresses an issue with certain genes that were present in the results for some seeds/data types, but not others. See #44 for some background and examples.
After exploring this, there ended up being a few separate issues, all of which are fixed in this PR:
mpmp/utilities/tcga_utilities.py
for the list ofvalid_samples
- switched to passing around lists to make the order deterministicmpmp/utilities/tcga_utilities.py
, I was applying our cancer type mutation filter (>15 samples mutated, >5% samples mutated) before removing hyper-mutated samples. This led to some genes that had fewer than 15 positive labels in the final dataset, some of which would drop out further down the pipeline. I switched the order of the hyper-mutated samples filter and the cancer type filter, which makes things much more predictable/consistent.The results of all these changes are that we end up with fewer valid genes total (85 for gene expression, 84 for methylation, 75 for all data types -> see
01_explore_data/missing_genes.ipynb
), but these genes are now consistent across data types and across signal/shuffled models within a given experiment. The end result is that our conclusions should be more robust and less prone to spurious signal/small sample sizes.Looking at the genes that get dropped in the new experiments, most of them seem to be low signal/hard to predict (near the origin in our volcano plots):
I'm currently re-running all of our experiments with these updated gene sets, which will be a separate future PR (although I don't expect it to change our conclusions).