Design Review 1: KLIFF Trainer framework

ipcamit commented 5 months ago

Idea: A coherent fully reproducible framework to train all supported openkim models (including ML ones) and be able to archive exact training methodology for reuse and be available from openkim.

So easiest is to create a yaml file which can be hashed to ensure integrity and this hash can be included in kimspec.edn etc for provenance. This PR request contains first draft of this framework, along with its implementation for KIM physics based models. I think better to have a look at it now before having to review huge changes at once. Example yaml file is also there at the bottom.

Core changes and contributions:

Dataset: dataset object can now load weights from a file (I think this was also there on one of the todo list). the weights file has to be a 4 column whitespaced ascii file, that numpy can directly load. The 4 columns represent the configuration, energy, forces and stress weights. The length of the file has to be either 1 (all weights are same) or n (where n is the number of configurations the weights are to be set). In both ml and kim based training, these per configuration weights will be used.
_exceptions: This is just a proposal, and not very strong one at that, but I think it might be useful to collect all of the file based exceptions, and collect them under single file. This gives the flexibility to use them anywhere in the KLIFF more freely. Currently we might run in circular imports if two related modules want to use same exception. (Example Trainer base class and KIMtrainer should raise Trainer error). Let me know your thoughts.
kim_residuals: simple collection of most used loss functions. Another idea is to save the pickled version of the loss function for reproducibility.
kliff_trainer: This is the base class for trainer framework. All trainers must inherit this class. This includes some basic functionality: a. Initializing the class members from yaml / dict b. Seed all modules, numpy, scipy, torch etc c. Setup workspace folder and model folders etc where each independent run would be automatically timestamped and saved d. setup dataset, load the data, apply any transformation of properties, like energy normalization, compute fingerprint objects, hash the dataset based on path, transforms and weights, save the processed dataset as dill pickle object. Now when the next time any trainer need same dataset configuration, it can be reloaded directly. e. setup test train split, either based on indices files, or ranges etc, split the dataset in test data and train data f. call trainer specific functions, setup_model, setup_parameter_transforms, setup_optimizer g. hash and archive the training configuration
kim_trainer: First example of subclassing this trainer class for KIM physics based models. It basically extends setup_model, setup_parameter_transforms, setup_optimizer, and save_kim_model
enumerations: enumerated list of supported options.

Tar model: KIM model tarred in a single file. For future proofing against LLNL/KIMKit.

Question: Is there an easy for kimpy to tell me what is the model driver that a model is using.

Future direction: I would like to have a Trainer eventually for each model that can be submitted to the openkim. for v1 we will have Torch and KIM. The Torch trainer will also have a companion Lightning based trainer for v1. But for future versions I think it would make sense if we have Trainers that wrap around library like QUIP to provide complete training surface for newer QUIP driver based models and so on. I would also like to have Trainer built around bootstrapping and MCMC sampling that yonathan contributed, but it will take a little more time hence perhaps v1.2 etc. For v1 the UC part will remain unchanged.

Let me know of thoughts and questions. Sorry for huge wall of text, but I think had I delayed it more, it would only have increased in size!

Example yaml file:

```yaml# Example yaml for kliff trainer workspace: name: test_run # Name of the base workspace folder, where all the runs will be stored seed: 12345 # Seed for random number generator, all resume: False # Resume training from a previous run walltime: 2:00:00 # Walltime for the run dataset: type: ase # ase or path or colabfit path: Si.xyz # Path to the dataset save: False # Save processed dataset to a file shuffle: False # Shuffle the dataset keys: energy: Energy # Key for energy, if ase dataset is used forces: forces # Key for forces, if ase dataset is used training_dataset: train_size: 3 # Number of training samples train_indices: # files with indices [optional] val_dataset: val_size: 1 # Number of validation samples val_indices: # files with indices [optional] test_dataset: test_size: test_indices: colabfit_dataset: dataset_name: database_name: database_url: model: model_type: kim # kim or torch or tar model_path: ./ model_name: SW_StillingerWeber_1985_Si__MO_405512056662_006 transforms: property: # optional: path to property transform file name: NormalizedPropertyTransform instance: # optional: instance of custom property transform property_key: energy # optional: key for the property to transform parameter: instance: parameter_list: # optional for KIM models, list of parameters to optimize - A: # doct means the parameter is transformed name: LogParameterTransform value: 0.1 bounds: [[-5, 5]] - B # these are the parameters that are not transformed - sigma configuration: # optional: generate fingerprints from the configuration name: Descriptor kwargs: cutoff: 3.7 species: ["Si"] descriptor: "SymmetryFunctions" hyperparameters: set51 instance: training: loss: function: MSE weights: # optional: path to weights file energy: 1.0 forces: 1.0 config: 1.0 stress: 0.0 normalize_per_atom: true optimizer: name: L-BFGS-B provider: scipy learning_rate: kwargs: tol: 1.e-6 # 1. is necessary, 1e-6 is treated as string batch_size: 1 epochs: 1 device: cpu num_workers: 2 chkpt_interval: 1 stop_condition: verbose: True export: # optional: export the trained model model_type: tar # kim or tar model_path: ./ model_name: SW_StillingerWeber_trained_1985_Si__MO_405512056662_006 ```

codecov-commenter commented 5 months ago

Codecov Report

Attention: Patch coverage is 2.31548% with 675 lines in your changes are missing coverage. Please review.

Project coverage is 55.43%. Comparing base (9ee3ce8) to head (2f57f56).

:exclamation: Current head 2f57f56 differs from pull request most recent head d3050a6. Consider uploading reports for the commit d3050a6 to get more accurate results

Files	Patch %	Lines
kliff/trainer/kliff_trainer.py	0.00%	346 Missing :warning:
kliff/trainer/kim_trainer.py	0.00%	182 Missing :warning:
kliff/trainer/option_enumerations.py	0.00%	87 Missing :warning:
kliff/dataset/dataset.py	23.63%	42 Missing :warning:
kliff/utils.py	27.27%	8 Missing :warning:
kliff/trainer/kim_residuals.py	0.00%	5 Missing :warning:
kliff/_exceptions.py	0.00%	3 Missing :warning:
kliff/trainer/__init__.py	0.00%	2 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## v1 #172 +/- ## ========================================== - Coverage 63.14% 55.43% -7.71% ========================================== Files 50 56 +6 Lines 4718 5401 +683 ========================================== + Hits 2979 2994 +15 - Misses 1739 2407 +668 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

mjwen commented 4 months ago

Sorry for the late reply!

Question: Is there an easy for kimpy to tell me what is the model driver that a model is using.

I cannot remember. Better check with Ryan.

I only have major comments for the trainer.

I don't think the trainer need to deal with dataset and such (like dataset initializing and split). Instead, the train/val/test datasets should be set up outside the trainer, and their instances be passed to the fit \ test functions of the trainer. This is following the lightning approach and I think it is a good approach.
Related to 1, it might be better to only pass the config dict info related to training to the trainer, other stuff (like model, dataset, transforms) all be passed in as an instance. Basically, I am thinking of a trainer like
```
Trainer:
def __init__(model: Union[KIMModel, TorchModel, TarModel], transform: Transform, training_config: dict):

def fit(train_loader, val_loader):

def test(test_loader):

# other necessary functions such as save kim model  
```
As you can see, this is again modeling after lightning. By moving the instantiation of the model, dataset/loader, transform, out of the Trainer can make it much more simper, which should be easier for maintaining.
There exist good packages to deal with configs that we can take advantage of. We don't need to do all the checking and such in the trainer for all hyperparameters, which would be very difficult to maintain in the long run. Specifically, I am considering omegaconf and hydra. The latter uses the former. The nice feather of hydra is that you can define a base config that is shared by all, and customize it for various specific cases. Also, it is composable, making it much easier to separate the configs for the various parts. schnetpack uses it: https://github.com/atomistic-machine-learning/schnetpack/blob/master/src/schnetpack/configs/train.yaml and I've used it in one of my projects as well https://github.com/mjwen/rxnrep/tree/main/configs, which may give you a feeling of what it look like. hydra also have utility functions e.g. for instantiating class https://hydra.cc/docs/advanced/instantiate_objects/overview/, which we can take advantage of instead of using importlib.

ipcamit commented 4 months ago

I only have major comments for the trainer.

1. I don't think the trainer need to deal with dataset and such (like dataset initializing and split). Instead, the train/val/test datasets should be set up outside the trainer, and their instances be passed to the `fit` \ `test` functions of the trainer. This is following the lightning approach and I think it is a good approach.

I can look into the hydra library, but for datasets, models etc, do you think it will be better to have different utility functions etc to setup datasets and models? Whole idea is to make sure that using this single yaml file you should be able to get same results in term of dataset splits etc.

mjwen commented 4 months ago

Yes, I'm still imaging a single yaml file to config all the components, but the single yaml file can be composed from a couple separate files. If one does not like to use multiple files, all the parts can still be put in a single file.

do you think it will be better to have different utility functions etc to setup datasets and models?

I guess, we can provide these low-level parts to provide more granularity - these would be extremely useful of one wants to build a new trainer for a new model and such. But at the same time, we can use the low-level functionality to achieve one-yaml-file style setups to achieve reproducibility and such. This can be done, e.g. via an example training script, that first instantiates all the components, e.g. dataset and model, and then passes these to the trainer to train the model and log metrics and such. I guess the major difference is instantiating the classes inside the Trainer or outside. I think the latter is slightly better.

mjwen commented 4 months ago

I have mixed feelings with hydra -- it can simplify configs management, but tracking and debugging can be a bit more involved. I am not sure whether we want to use it here. Probably you explore it a bit and we can then decide. But even if we don't use hydra, I think we still need to use OmegaConf to read, merge, write config files, instead of the vanilla yaml module.

ipcamit commented 4 months ago

Yes, I'm still imaging a single yaml file to config all the components, but the single yaml file can be composed from a couple separate files. If one does not like to use multiple files, all the parts can still be put in a single file.

This condition is already satisfied to some extent. The configuration is split into 6 independent blocks,

workspace: local folder/cache/other information directory.
dataset: instance of KLIFF dataset
model: instance of KIMModel or TorchScript model
transforms: instance of property, parameter, configuration transforms
training: loss function, optimizer, verbose
export: export the trained model, KIM-API model or tarball

I guess I can try splitting it into 3 modules,

Dataset: setup dataset, along with test train split
Model: setup KIM, Torch, Tar model etc. It will also consume export module
Trainer module: workspace management (base class), training loop (child class), optimizer (child class)

Optimizer can be a separate module but it would be difficult to organize as a separate module owing to scipy's different API. Can make a private optimizer module for torch.

Thoughts, suggestions?

Reason for above division:

workspace: cant be independent, as this is specifically setup and used by trainer
dataset: can be independent, currently setup using setup_dataset
model: can be independent, currently setup using setup_model
transforms: Parameter transforms cannot be independent, and configuration transforms are only used by torch, so a general module might not be that coherent. So included in Trainer
training: trainer module
export: can be merged with model

mjwen commented 4 months ago

Yes, the previous yaml file does serve the purpose of separating them into different parts. We just need to organize the internal codes in that direction a bit.

I agree with the three-module approach. Particularly, we want to integrate the optimizer with a trainer for the exact reason you've mentioned.

Also agreed that workspace settings need to be separate, and not associated with a trainer.

ipcamit commented 3 months ago

Simplified the trainer.
Intialization methods for model and datasets.

Note on glossary, everywhere the configuration dictionary is called "manifest", as configuration was becoming too confusing, example "dataset_from_configuration" creates a confusion on whether it uses the kliff.Configuration object. Hence I suggest we use "manifest" as the word of choice for yaml configurations. So KLIFF trainer takes in a training manifest, and splits it accordingly in model, dataset and transform manifest, and passes it to appropriate static functions in Dataset and KIMModel, such that they return require objects.

Regarding Omegaconf, I did integrate it, but have to remove it afterwards as it adds lot of needless complexity with no clear advantage. Biggest issue was that it only support "primitive" dataypes in python, something as basic as numpy array needs custom solutions etc. dict is flexible enough to be a better solution I think.

There might be some rough edges, that will clear as we add more functionality like the ML trainer.

ipcamit commented 3 months ago

Disregard for the time being, I need to work a bit more on this before opening the PR again.

openkim / kliff

Design Review 1: KLIFF Trainer framework #172

Codecov Report