Everything important (e.g.: model weights) should be accessible through the shared project directory.
I primarily used my home directory to store the code since I could sync it up with GitHub whereas on the shared project directory we have no internet access to perform “git pull/push” operations.
Thus for files/folders that were too large to store in HOME, I used symbolic links to folders located in the shared project directory. However, I forget exactly how I laid everything out and I no longer have access to the VPN to connect and check.
Nonetheless, all the important stuff we would need, like model checkpoints, should be stored in the shared project directory.
Training splits can be found on the GitHub page as well as all my most recent code.
All the issues we encountered with this project are tracked via GitHub. I list some of the more relevant issues below:
Basically the only ones that matter are results/model_checkpoints
and v103
. The rest are just some tests I did to resolve/debug issues.
results/model_checkpoints
- These are the models trained on random splits v103
- pocket-only representation checkpoints v113
- new training split where we excluded highly targeted (OncoKB) proteins from training.
v115
- since "aflow" (alphaflow edge weights) models had a smaller dataset (due to memory issues when running Alphaflow on AA sequences 1200+) we artificially reduced the sizes of the training sets for the other models so that we could have a fair comparison
v128
- Test to see if new splits were the issue with weirdly low performance with oncoKB split (they were)When we originally started looking into OncoKB I selected highly targeted proteins from OncoKB to be excluded from training sets.
Stats on the distribution differences between the manually curated oncokb dataset split vs a random split can be found on the issue page.
This means for the pocket versions of our models we can’t readily use existing scripts to get the pocket sequence graph based on the PDBs provided.
This tracks how the pocket representation of Davis and Kiba models was built. The pull request 135 resolves this with the results in the CSV files.