Open joeloskarsson opened 3 weeks ago
These are some very important considerations. I myself have angered some colleagues by making old checkpoints unusable. Now I am also looking at #49 which would introduce much more flexibility to the user wrt model choices. Mostly for that reason and because I don't think we have the human-power to assure backwards compatibility I am leaning towards option 1. Maybe in the future with a more stable repo + more staff we can implement 3? What I would do now is very solid logging with wandb:
With such information every checkpoint should be usable for a long time. Maybe I am very much overestimating how much time 3 would require. If that is the case I gladly change my opinion.
I am a bit unsure myself about how much work it would really be. As long as we only rename members or change the hierarchy of nn.Modules
then it just boils down to renaming keys in the state dict. This we already have an implementation for here:
https://github.com/mllam/neural-lam/blob/9d558d1f0d343cfe6e0babaa8d9e6c45b852fe21/neural_lam/models/ar_model.py#L584-L596
It just has to be generalized to more than g2m_gnn.grid_mlp.0.weight
.
When things can get tricky is if we reorder input features or change dimensionalities of something. But thinking about this a bit more now I realize:
latent_dim -> 2 x latent_dim -> latent_dim
instead of latent_dim -> latent_dim -> latent_dim
), then we can anyhow not re-use any old checkpoint because there are new entries in the weight matrices. If we would reduce some dimensionality we would also not be able to convert an old checkpoint, as we would have a different output from that part of the model. But these changes can to a very large extent be made optional, and then you just have to make sure to set these options correctly for old checkpoints to be compatible. But this points towards leaving 1 as an option in such cases, and not saying "we will never introduce a change that breaks checkpoint in a way that they can not be converted to the new version".
Background
As we make more changes to the code there will be points where checkpoints from saved models can not be directly loaded in a newer version of neural-lam. This happens in particular if we start making changes to variable names of nn.Module attributes and the overall structure of the model classes. It would be good to have a policy of how we want to handle such breaking changes. This issue is for discussing this.
Proposals
I see three main options:
ARModel
: https://github.com/mllam/neural-lam/blob/9d558d1f0d343cfe6e0babaa8d9e6c45b852fe21/neural_lam/models/ar_model.py#L576-L596Considerations for point 2 and 3
My view
on_load_checkpoint
would get unnecessarily complicated and I'd rather just do the conversion once and have a set of new checkpoint files. It is also easy to do both 2 and 3: if you try to load an old checkpoint you just convert it before loading.Tagging @leifdenby and @sadamov to get your input.