meyer-lab / mechanismEncoder

Developing patient-specific phosphoproteomic models using mechanistic autoencoders
4 stars 1 forks source link

implement pretraining pipeline #16

Closed FFroehlich closed 3 years ago

FFroehlich commented 3 years ago

Adds an implementation for pretraining. With this setup Model training with 10 local starts for the full problem can be done in <1h with 4 local cores and not using parallelization in all steps.

Will add more documentation over the coming days.

codecov-io commented 3 years ago

Codecov Report

Merging #16 (8970883) into master (37a2310) will decrease coverage by 12.42%. The diff coverage is 64.63%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master      #16       +/-   ##
===========================================
- Coverage   95.76%   83.33%   -12.43%     
===========================================
  Files           7        9        +2     
  Lines         354      414       +60     
===========================================
+ Hits          339      345        +6     
- Misses         15       69       +54     
Flag Coverage Δ
unittests 83.33% <64.63%> (-12.43%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
mEncoder/pretraining.py 0.00% <0.00%> (ø)
mEncoder/training.py 92.15% <86.66%> (-1.03%) :arrow_down:
mEncoder/encoder.py 94.28% <91.30%> (+1.42%) :arrow_up:
mEncoder/autoencoder.py 100.00% <100.00%> (+1.36%) :arrow_up:
mEncoder/generate_data.py 100.00% <100.00%> (ø)
mEncoder/mechanistic_model.py 90.81% <100.00%> (-0.78%) :arrow_down:
mEncoder/petab_subproblem.py 100.00% <100.00%> (ø)
mEncoder/test/test_encoder.py 100.00% <100.00%> (ø)
mEncoder/test/test_model.py 100.00% <100.00%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 37a2310...8970883. Read the comment docs.

FFroehlich commented 3 years ago

This fixes multiple bugs that led to an incorrect formulation of the whole problem. Training for a medium size model (19 proteins, 22 phospho sites) full training runs in 2h on a desktop machine with 4 cores without using parallelization in all steps.

data: synthetic__FLT3_MAPK_AKT_STAT.pdf

fit: FLT3_MAPK_AKT_STATsynthetic2fidesfit.pdf

aarmey commented 3 years ago

@FFroehlich I've updated the testing on master to use a self-hosted runner, to get around the env changes with Github Actions. Let me know if you have any difficulties once you merge these changes.

FFroehlich commented 3 years ago

@FFroehlich I've updated the testing on master to use a self-hosted runner, to get around the env changes with Github Actions. Let me know if you have any difficulties once you merge these changes.

Oh sorry, I already fixed everything necessary to adapt to the new env API on GHA, so that wouldn't have been necessary. With the self-hosted runner the individual jobs seem to be queue for quite a while.

aarmey commented 3 years ago

Our lab's queue definitely varies from day to day, and is behind today... absolutely feel free to change it back if you'd like.