univieCUBE / deepnog

Protein orthologous group assignment with deep learning
BSD 3-Clause "New" or "Revised" License
26 stars 8 forks source link

Training facilities #16

Closed VarIr closed 4 years ago

VarIr commented 4 years ago

So far, deepnog only allow to perform inference using models trained by us, the developers, in separate Jupyter notebooks. This does not scale to a larger number of models for more levels of EggNOG, or even other orthology databases.

This PR introduces components that allow users to train custom models. The primary use case is taking the DeepNOG (=DeepEncoding) architecture, and train additional levels of EggNOG. Additionally, different architectures can be introduced, and trained, with reasonable effort.

At the heart of this PR is the new training.py that runs training and validation epochs. The Dataset and DataLoader classes now support labels, and a ShuffledDataset is introduced that still iterates over the FASTA file, but can shuffle the input data (s.t. a user-defined buffer size). EDIT: An additional ProteinDataset class is introduced that features random access. This enables complete shuffling of sequences, ensuring that minibatches differ in epochs, which might affect training. This comes at the cost of first loading the complete fasta file and storing in memory. This might replace the ShuffledDataset.

The client now uses two subparsers: train and infer.

The general package structure was reworked to offer more intuitive modularity.

deepnog now uses a YAML configuration file, which includes the supported databases (for inference), architectures (for training) and possibly more.

lgtm-com[bot] commented 4 years ago

This pull request introduces 4 alerts when merging 3c0934d941d290e5fa2d17e7123cd59c36a75df3 into 905ae06d31d0492ba6b7675f4bf463f0295742fa - view on LGTM.com

new alerts:

lgtm-com[bot] commented 4 years ago

This pull request introduces 3 alerts when merging 5ca4e1a3432f9a8c672df47feefeef942cb610ce into 905ae06d31d0492ba6b7675f4bf463f0295742fa - view on LGTM.com

new alerts:

codecov[bot] commented 4 years ago

Codecov Report

Merging #16 into master will increase coverage by 1.17%. The diff coverage is 96.37%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #16      +/-   ##
==========================================
+ Coverage   95.48%   96.66%   +1.17%     
==========================================
  Files          14       33      +19     
  Lines         576     2036    +1460     
==========================================
+ Hits          550     1968    +1418     
- Misses         26       68      +42     
Impacted Files Coverage Δ
deepnog/learning/training.py 85.30% <85.30%> (ø)
deepnog/utils/io_utils.py 96.36% <89.47%> (ø)
deepnog/learning/inference.py 96.00% <93.10%> (ø)
deepnog/utils/network.py 94.00% <94.00%> (ø)
deepnog/models/deepfam.py 95.75% <95.75%> (ø)
deepnog/utils/tests/test_utils.py 96.19% <96.19%> (ø)
deepnog/models/deepnog.py 96.92% <96.92%> (ø)
deepnog/client/client.py 97.54% <97.54%> (ø)
deepnog/data/dataset.py 97.80% <97.80%> (ø)
deepnog/client/tests/test_cli.py 99.38% <99.38%> (ø)
... and 47 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 905ae06...59ef6b7. Read the comment docs.

lgtm-com[bot] commented 4 years ago

This pull request introduces 1 alert when merging 3c2883c67a7ef16afa065a65e8fb8aa652f8160f into 905ae06d31d0492ba6b7675f4bf463f0295742fa - view on LGTM.com

new alerts:

lgtm-com[bot] commented 4 years ago

This pull request introduces 1 alert when merging 7932833ab163fe736394ecff593de585e9759150 into 905ae06d31d0492ba6b7675f4bf463f0295742fa - view on LGTM.com

new alerts:

VarIr commented 4 years ago

Closes #18

lgtm-com[bot] commented 4 years ago

This pull request introduces 1 alert when merging f5363593291536d343eeda333cf9ce6ed1e414e6 into 905ae06d31d0492ba6b7675f4bf463f0295742fa - view on LGTM.com

new alerts: