WIP: Implement bootstrap

yonatank93 commented 1 year ago

We want to implement a bootstrap method of UQ. In general, we do this by taking the compute arguments and samples from the list of compute arguments with replacement. Then, we train the model using this sample of compute arguments. The optimal parameters give a point in the ensemble.

TODO:

[x] Documentation
[x] Example(s)
[x] Tests
[x] Discuss changes done in some internal functions

yonatank93 commented 1 year ago

Note that this is still a very rough draft, though functional. The intended use of the class, at least for empirical models, is illustrated in examples/examples_bootstrap_SW_Si.py. I want to have a similar workflow/syntax for NN models.

mjwen commented 1 year ago

thanks @yonatank93! Let me know when ready and I'll take a look.

yonatank93 commented 1 year ago

I will @mjwen. Sorry that it is still very premature. By the way, the example for running bootstrap that I currently have reflects the workflow I intend users to use. That is, I want users to get an ensemble of parameters first before propagating the error to some other predictions, or in other words, the set of parameters is fixed. This is to answer your previous email.

mjwen commented 1 year ago

@yonatank93 totally ok! I just wanted to make sure I get notified when it is ready.

yonatank93 commented 1 year ago

@mjwen The bootstrap implementation is ready for you to look at. I haven't added anything in kliff/doc, I will work on this once we are ok with this implementation.

mjwen commented 1 year ago

hi @yonatank93 Looks great in general. I've added a few comments. Another general comment: can you please add type hints for the new stuff you've added? I saw some classes/functions have it but some do not.

yonatank93 commented 1 year ago

Hi @mjwen, I have applied the minor feedback that you have. There are several things that I want to discuss with you.

Currently I have bootstrap_compute_arguments for both BootstrapEmpiricalModel and BootstrapNeuralNetworkModel. Especially for neural network model, we don't use compute arguments, but we use the fingerprints to generate the bootstrap dataset. Do you have any preference if we should keep it as is now or if we should call it bootstrap_compute_arguments for empirical model and bootstrap_fingerprints for neural network model? Another option is to call both as bootstrap_dataset. What's your thought?
In your comment, I am not getting your point about creating compute arguments. I think I decided to export the bootstrap compute arguments as a json file containing the identifiers so that I can just export as little information needed as possible. I assume when we instantiate Loss, which is a required argument for Bootstrap, we generate the compute arguments. In the load_bootstrap_compute_arguments, my intention is to use this dictionary of identifiers and the compute arguments in Loss to generate bootstrap compute arguments by the means of querying. Please let me know your thought on this.

mjwen commented 1 year ago

Hi @mjwen, I have applied the minor feedback that you have. There are several things that I want to discuss with you.

Currently I have bootstrap_compute_arguments for both BootstrapEmpiricalModel and BootstrapNeuralNetworkModel. Especially for neural network model, we don't use compute arguments, but we use the fingerprints to generate the bootstrap dataset. Do you have any preference if we should keep it as is now or if we should call it bootstrap_compute_arguments for empirical model and bootstrap_fingerprints for neural network model? Another option is to call both as bootstrap_dataset. What's your thought?

Let's keep what we have now? Even for the original neural networks, we do not have compute_arguments (in the sense of kim models), but fingerprints. We specifically name the fingerprints as compute arguments to keep a uniform interface for the optimizer. In retrospect, it might not be a good idea, but let's keep it consistent for now.

In your comment, I am not getting your point about creating compute arguments. I think I decided to export the bootstrap compute arguments as a json file containing the identifiers so that I can just export as little information needed as possible. I assume when we instantiate Loss, which is a required argument for Bootstrap, we generate the compute arguments. In the load_bootstrap_compute_arguments, my intention is to use this dictionary of identifiers and the compute arguments in Loss to generate bootstrap compute arguments by the means of querying. Please let me know your thought on this.

You are right. I misunderstood. I was thinking you read in the bootstrapped identified and recreate the compute arguments, which is not ideal. But in fact, you are only using it to index the compute arguments. This is all good.

A minor stuff: for docs, we've transferred from the numpy style to the google style (see here for an example), so can you please update yours? It has some minor benefits: because we are using type hints now, we do not need to write the type in the doc string, and we have used a package to parse it from the type hints and add the type to the generated html docs.

yonatank93 commented 1 year ago

@mjwen I have updated the documentation to use google style. Hopefully, I didn't miss any. Another thing that I am planning to work on is writing a documentation page about bootstrapping, similar to what we have for MCMC.

Also, in the current version, there is no option to use parallelization when running bootstrap, aside from parallelization at the model evaluation level.

yonatank93 commented 1 year ago

Hi @mjwen. I just added the documentation page for bootstrapping.

mjwen commented 1 year ago

@yonatank93 thanks! We are almost there. Can you please address the comments?

yonatank93 commented 1 year ago

@mjwen Thanks for the feedback. I have addressed most of them. There are a couple that I want to discuss with you:

About specifying random seed (https://github.com/openkim/kliff/pull/107#discussion_r1150005677 )
About training before running bootstrap (https://github.com/openkim/kliff/pull/107#discussion_r1150008024)

mjwen commented 1 year ago

@yonatank93 sorry for the latest response. I just added comments to your two questions.

yonatank93 commented 1 year ago

@mjwen Thank you for the comments and clarification. I addressed your comments above. Please let me know your thoughts.

mjwen commented 1 year ago

@yonatank93 Once you've updated the last bit and made it passing the checks, I can merge it.

Note, you may need to pull the latest master branch to get the new pre-commit versions to pass the linting. Or you can find the versions here: https://github.com/openkim/kliff/blob/master/.pre-commit-config.yaml

yonatank93 commented 1 year ago

Ok. I might also want to make some changes to the weight module related to the issue mentioned in #116.

yonatank93 commented 1 year ago

@mjwen I applied what we just discussed. Now I see that adding the argument seed is more elegant, like what you suggested. I can also see how this automatically leads to reproducible UQ results, even if the user doesn't pay attention to it. I use the seed to create a local random number generator, which I think satisfies some of my reasoning above too.

I also ran pre-commit and built the documentation, but you might want to double-check it. It is ready for you to review again.

mjwen commented 1 year ago

We have CI checking the formatting, using the same pre-commit configs. So as long as it passes the checks, it is good.

Everything is good now. I've merged it. Thanks!

yonatank93 commented 1 year ago

Thanks for your help.

openkim / kliff

WIP: Implement bootstrap #107