greenelab / pancancer-evaluation

Evaluating genome-wide prediction of driver mutations using pan-cancer data
BSD 3-Clause "New" or "Revised" License
9 stars 3 forks source link

Implement and test single-cancer cross-validation #1

Closed jjc2718 closed 3 years ago

jjc2718 commented 3 years ago

Goal of this PR is to set up cross-validation to compare the following cases:

  1. Train a mutation prediction model on 80% of data from a single cancer type, and test on the other 20% of data from that cancer type
  2. Train a mutation prediction model on 80% of data from a single cancer type + all other data from TCGA, test on the other 20% of data from that cancer type

This PR just implements and tests the code to split the gene expression data. Next PR will deal with processing mutation labels and writing a script to run the comparison across the top 50 most common mutations in TCGA.

Question: currently I'm storing file paths in a config file at pancancer_utilities/config.py. The file paths in here work fine when I install the package in development mode (e.g. using pip install -e .), but when I install it the standard way (e.g. using pip install .) the file paths break because only the pancancer_utilities directory is copied and the root directory no longer exists. Do you have suggestions for a better way to handle specifying file paths and making them accessible anywhere in the root directory?

I've been trying to avoid hardcoding the root directory (currently using pathlib.Path(__file__) to get it dynamically instead) but if hardcoding it is the easiest solution I'm fine with doing that.

jjc2718 commented 3 years ago

Thanks for the config file/file paths ideas! I'm going to wait on this for now since I want to get my experiments running, but I created #3 to revisit how I'm doing this in the future.