Goal of this PR is to set up cross-validation to compare the following cases:
Train a mutation prediction model on 80% of data from a single cancer type, and test on the other 20% of data from that cancer type
Train a mutation prediction model on 80% of data from a single cancer type + all other data from TCGA, test on the other 20% of data from that cancer type
This PR just implements and tests the code to split the gene expression data. Next PR will deal with processing mutation labels and writing a script to run the comparison across the top 50 most common mutations in TCGA.
Question: currently I'm storing file paths in a config file at pancancer_utilities/config.py. The file paths in here work fine when I install the package in development mode (e.g. using pip install -e .), but when I install it the standard way (e.g. using pip install .) the file paths break because only the pancancer_utilities directory is copied and the root directory no longer exists. Do you have suggestions for a better way to handle specifying file paths and making them accessible anywhere in the root directory?
I've been trying to avoid hardcoding the root directory (currently using pathlib.Path(__file__) to get it dynamically instead) but if hardcoding it is the easiest solution I'm fine with doing that.
Thanks for the config file/file paths ideas! I'm going to wait on this for now since I want to get my experiments running, but I created #3 to revisit how I'm doing this in the future.
Goal of this PR is to set up cross-validation to compare the following cases:
This PR just implements and tests the code to split the gene expression data. Next PR will deal with processing mutation labels and writing a script to run the comparison across the top 50 most common mutations in TCGA.
Question: currently I'm storing file paths in a config file at
pancancer_utilities/config.py
. The file paths in here work fine when I install the package in development mode (e.g. usingpip install -e .
), but when I install it the standard way (e.g. usingpip install .
) the file paths break because only thepancancer_utilities
directory is copied and the root directory no longer exists. Do you have suggestions for a better way to handle specifying file paths and making them accessible anywhere in the root directory?I've been trying to avoid hardcoding the root directory (currently using
pathlib.Path(__file__)
to get it dynamically instead) but if hardcoding it is the easiest solution I'm fine with doing that.