Closed ipcamit closed 3 months ago
Attention: Patch coverage is 2.31548%
with 675 lines
in your changes are missing coverage. Please review.
Project coverage is 55.43%. Comparing base (
9ee3ce8
) to head (2f57f56
).:exclamation: Current head 2f57f56 differs from pull request most recent head d3050a6. Consider uploading reports for the commit d3050a6 to get more accurate results
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Sorry for the late reply!
Question: Is there an easy for kimpy to tell me what is the model driver that a model is using.
I cannot remember. Better check with Ryan.
I only have major comments for the trainer.
I don't think the trainer need to deal with dataset and such (like dataset initializing and split). Instead, the train/val/test datasets should be set up outside the trainer, and their instances be passed to the fit
\ test
functions of the trainer. This is following the lightning approach and I think it is a good approach.
Related to 1, it might be better to only pass the config dict info related to training to the trainer, other stuff (like model, dataset, transforms) all be passed in as an instance. Basically, I am thinking of a trainer like
Trainer:
def __init__(model: Union[KIMModel, TorchModel, TarModel], transform: Transform, training_config: dict):
def fit(train_loader, val_loader):
def test(test_loader):
# other necessary functions such as save kim model
As you can see, this is again modeling after lightning. By moving the instantiation of the model, dataset/loader, transform, out of the Trainer can make it much more simper, which should be easier for maintaining.
There exist good packages to deal with configs that we can take advantage of. We don't need to do all the checking and such in the trainer for all hyperparameters, which would be very difficult to maintain in the long run. Specifically, I am considering omegaconf and hydra. The latter uses the former. The nice feather of hydra is that you can define a base config that is shared by all, and customize it for various specific cases. Also, it is composable, making it much easier to separate the configs for the various parts. schnetpack uses it: https://github.com/atomistic-machine-learning/schnetpack/blob/master/src/schnetpack/configs/train.yaml and I've used it in one of my projects as well https://github.com/mjwen/rxnrep/tree/main/configs, which may give you a feeling of what it look like. hydra also have utility functions e.g. for instantiating class https://hydra.cc/docs/advanced/instantiate_objects/overview/, which we can take advantage of instead of using importlib
.
I only have major comments for the trainer.
1. I don't think the trainer need to deal with dataset and such (like dataset initializing and split). Instead, the train/val/test datasets should be set up outside the trainer, and their instances be passed to the `fit` \ `test` functions of the trainer. This is following the lightning approach and I think it is a good approach.
I can look into the hydra library, but for datasets, models etc, do you think it will be better to have different utility functions etc to setup datasets and models? Whole idea is to make sure that using this single yaml file you should be able to get same results in term of dataset splits etc.
Yes, I'm still imaging a single yaml file to config all the components, but the single yaml file can be composed from a couple separate files. If one does not like to use multiple files, all the parts can still be put in a single file.
do you think it will be better to have different utility functions etc to setup datasets and models?
I guess, we can provide these low-level parts to provide more granularity - these would be extremely useful of one wants to build a new trainer for a new model and such. But at the same time, we can use the low-level functionality to achieve one-yaml-file style setups to achieve reproducibility and such. This can be done, e.g. via an example training script, that first instantiates all the components, e.g. dataset and model, and then passes these to the trainer to train the model and log metrics and such. I guess the major difference is instantiating the classes inside the Trainer or outside. I think the latter is slightly better.
I have mixed feelings with hydra -- it can simplify configs management, but tracking and debugging can be a bit more involved. I am not sure whether we want to use it here. Probably you explore it a bit and we can then decide. But even if we don't use hydra, I think we still need to use OmegaConf to read, merge, write config files, instead of the vanilla yaml module.
Yes, I'm still imaging a single yaml file to config all the components, but the single yaml file can be composed from a couple separate files. If one does not like to use multiple files, all the parts can still be put in a single file.
This condition is already satisfied to some extent. The configuration is split into 6 independent blocks,
I guess I can try splitting it into 3 modules,
Optimizer can be a separate module but it would be difficult to organize as a separate module owing to scipy's different API. Can make a private optimizer module for torch.
Thoughts, suggestions?
Reason for above division:
setup_dataset
setup_model
Yes, the previous yaml file does serve the purpose of separating them into different parts. We just need to organize the internal codes in that direction a bit.
I agree with the three-module approach. Particularly, we want to integrate the optimizer with a trainer for the exact reason you've mentioned.
Also agreed that workspace settings need to be separate, and not associated with a trainer.
Note on glossary, everywhere the configuration dictionary is called "manifest", as configuration was becoming too confusing, example "dataset_from_configuration" creates a confusion on whether it uses the kliff.Configuration
object. Hence I suggest we use "manifest" as the word of choice for yaml configurations. So KLIFF trainer takes in a training manifest, and splits it accordingly in model, dataset and transform manifest, and passes it to appropriate static functions in Dataset and KIMModel, such that they return require objects.
Regarding Omegaconf, I did integrate it, but have to remove it afterwards as it adds lot of needless complexity with no clear advantage. Biggest issue was that it only support "primitive" dataypes in python, something as basic as numpy array needs custom solutions etc. dict is flexible enough to be a better solution I think.
There might be some rough edges, that will clear as we add more functionality like the ML trainer.
Disregard for the time being, I need to work a bit more on this before opening the PR again.
Idea: A coherent fully reproducible framework to train all supported openkim models (including ML ones) and be able to archive exact training methodology for reuse and be available from openkim.
So easiest is to create a yaml file which can be hashed to ensure integrity and this hash can be included in
kimspec.edn
etc for provenance. This PR request contains first draft of this framework, along with its implementation for KIM physics based models. I think better to have a look at it now before having to review huge changes at once. Example yaml file is also there at the bottom.Core changes and contributions:
Dataset
: dataset object can now load weights from a file (I think this was also there on one of the todo list). the weights file has to be a 4 column whitespaced ascii file, that numpy can directly load. The 4 columns represent the configuration, energy, forces and stress weights. The length of the file has to be either 1 (all weights are same) or n (where n is the number of configurations the weights are to be set). In both ml and kim based training, these per configuration weights will be used._exceptions
: This is just a proposal, and not very strong one at that, but I think it might be useful to collect all of the file based exceptions, and collect them under single file. This gives the flexibility to use them anywhere in the KLIFF more freely. Currently we might run in circular imports if two related modules want to use same exception. (Example Trainer base class and KIMtrainer should raise Trainer error). Let me know your thoughts.kim_residuals
: simple collection of most used loss functions. Another idea is to save the pickled version of the loss function for reproducibility.kliff_trainer
: This is the base class for trainer framework. All trainers must inherit this class. This includes some basic functionality: a. Initializing the class members from yaml / dict b. Seed all modules, numpy, scipy, torch etc c. Setup workspace folder and model folders etc where each independent run would be automatically timestamped and saved d. setup dataset, load the data, apply any transformation of properties, like energy normalization, compute fingerprint objects, hash the dataset based on path, transforms and weights, save the processed dataset as dill pickle object. Now when the next time any trainer need same dataset configuration, it can be reloaded directly. e. setup test train split, either based on indices files, or ranges etc, split the dataset in test data and train data f. call trainer specific functions,setup_model
,setup_parameter_transforms
,setup_optimizer
g. hash and archive the training configurationkim_trainer
: First example of subclassing this trainer class for KIM physics based models. It basically extendssetup_model
,setup_parameter_transforms
,setup_optimizer
, andsave_kim_model
Tar model: KIM model tarred in a single file. For future proofing against LLNL/KIMKit.
Question: Is there an easy for kimpy to tell me what is the model driver that a model is using.
Future direction: I would like to have a Trainer eventually for each model that can be submitted to the openkim. for v1 we will have Torch and KIM. The Torch trainer will also have a companion Lightning based trainer for v1. But for future versions I think it would make sense if we have Trainers that wrap around library like QUIP to provide complete training surface for newer QUIP driver based models and so on. I would also like to have Trainer built around bootstrapping and MCMC sampling that yonathan contributed, but it will take a little more time hence perhaps v1.2 etc. For v1 the UC part will remain unchanged.
Let me know of thoughts and questions. Sorry for huge wall of text, but I think had I delayed it more, it would only have increased in size!
Example yaml file: