Open arushi-08 opened 1 year ago
Hi @arushi-08 ,
the checkpoint files are "just" normal torch archives, i.e., you can load them via torch.load
as done in the code snippet you linked (ore more precisely, just one line above; I have updated your text above to include it).
The checksum was calculated from the string representations of the model and the optimizer, cf. here https://github.com/pykeen/pykeen/blob/d1222b7c18d494290d285525b76c3f22e30db467/src/pykeen/training/training_loop.py#L203-L209
I would suggest that you load the checkpoint file via torch.load
and carefully compare it with the configuration. If you still think that everything is sane, I would suggest to manually overide the checkpoint file's checksum and write it to a new checkpoint file.
d = torch.load(path)
d["checksum"] = checksum
torch.save(d, new_path)
Hi,
I was having the same error. I believe the problem comes when using the scheduler object from PyTorch. We can observe in the constructor whenever last_epoch=-1
the initial_lr of the optimizer is updated.
Basically, this makes str(self.optimizer).encode("utf-8")
to be different, given that we have not yet reloaded the optimizer nor the scheduler.
I believe the issue can be solved by moving the checksum comparison to the end of the method.
@pablo-sanchez-sony, would you mind opening a PR with the changes you suggest?
Sure!
I am facing this checkpoint mismatch error in the same training loop for RotatE KGE model. Following log messages shows that rotate-checkpoint.pt is created at some initial epoch and then after 30 epochs it tries to read from this checkpoint and gives this error:
INFO:pykeen.training.training_loop:=> no checkpoint found at '/afs/ars539/.data/pykeen/checkpoints/rotate-checkpoint.pt'. Creating a new file.
Training epochs on cuda:0: 2%|▏ | 9/500 [07:47<6:22:12, 46.71s/epoch, loss=0.123, prev_loss=0.123]INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
Saved model weights to /afs/ars539/.data/pykeen/checkpoints/best-model-weights-0aa7f269-26c4-4e84-8d47-a55fc43911c6.pt
INFO:pykeen.training.training_loop:=> Saved checkpoint after having finished epoch 10.
INFO:pykeen.training.training_loop:=> Saved checkpoint after having finished epoch 10.
INFO:pykeen.training.training_loop:=> Saved checkpoint after having finished epoch 20.
INFO:pykeen.stoppers.early_stopping:Stopping early at epoch 30. The best result 0.14622531740871292 occurred at epoch 10.
INFO:pykeen.stoppers.early_stopping:Re-loading weights from best epoch from /afs/ars539/.data/pykeen/checkpoints/best-model-weights-0aa7f269-26c4-4e84-8d47-a55fc43911c6.pt
INFO:pykeen.training.training_loop:=> Saved checkpoint after having finished epoch 30.
INFO:pykeen.evaluation.evaluator:Evaluation took 547.88s seconds
Best is trial 0 with value: 0.06680432707071304.
INFO:pykeen.pipeline.api:loaded random seed 42 from checkpoint.
INFO:pykeen.pipeline.api:Using device: None
INFO:pykeen.stoppers.early_stopping:Inferred checkpoint path for best model weights: /afs/ars539/.data/pykeen/checkpoints/best-model-weights-ea7a231a-d250-422a-a747-49f6b3a70e2f.pt
INFO:pykeen.training.training_loop:=> loading checkpoint '/afs/ars539/.data/pykeen/checkpoints/rotate-checkpoint.pt'
[W 2023-09-22 18:37:16,297] Trial 1 failed with parameters: {'model.embedding_dim': 200, 'loss.margin': 1.0271124464019343, 'optimizer.lr': 0.026733931043720773, 'negative_sampler.num_negs_per_pos': 3, 'training.batch_size': 64} because of the following error: CheckpointMismatchError("The checkpoint file '/afs/ars539/.data/pykeen/checkpoints/rotate-checkpoint.pt' that was provided already exists, but seems to be from a different training loop setup.").
My training script is:
result = hpo_pipeline(
study_name='rotate_hpo',
training=training,
testing=testing,
validation=validation,
pruner="MedianPruner",
sampler="tpe",
model='RotatE',
model_kwargs={
"random_seed": 42,
},
model_kwargs_ranges=dict(
embedding_dim=dict(type=int, low=100, high=300, q=100),
),
negative_sampler_kwargs_ranges=dict(
num_negs_per_pos=dict(type=int, low=1, high=100),
),
stopper='early',
n_trials=30,
training_loop="sLCWA",
training_kwargs=dict(
num_epochs=500,
checkpoint_name='rotate-checkpoint.pt',
checkpoint_frequency=10,
),
evaluator_kwargs={"filtered": True, "batch_size":128},
)
Kindly suggest how to resolve this, as I am not explicitly trying to resume training, rather the hpo_pipeline itself is reloading from the checkpoint.
When setting a checkpoint name
checkpoint_name='rotate-checkpoint.pt',
it seems to be used for all trials => the second run thinks it is a continuation of the first trial, but the model hyperparameters do not match.
Here is a smaller reproduction script to reproduce the error
from pykeen.hpo import hpo_pipeline
result = hpo_pipeline(
study_name="rotate_hpo",
dataset="nations",
model="RotatE",
model_kwargs_ranges=dict(
embedding_dim=dict(type=int, low=8, high=24, q=8),
),
stopper="early",
n_trials=2,
training_loop="sLCWA",
training_kwargs=dict(
num_epochs=2,
checkpoint_name="rotate-checkpoint.pt",
checkpoint_frequency=1,
),
)
@arushi-08 , what is your use case for providing a checkpoint name? Do you want to save each trial's model? If yes, we have an explicit save_model_directory
for that, which will take care of creating one sub-directory per trial.
I have opened a small PR (#1324) to fail fast on the first trial with an error message about how to fix it 🙂
@pablo-sanchez-sony , would this resolve your issue, too?
I want to resume training my model from a checkpoint file (*.pt), but facing
pykeen.training.training_loop.CheckpointMismatchError
error.Full stack trace:
I have realised that the issue is with checksum mismatch i.e. the checkpoint file has a different configuration. https://github.com/pykeen/pykeen/blob/d1222b7c18d494290d285525b76c3f22e30db467/src/pykeen/training/training_loop.py#L1182-L1188
However, I am not sure how to load the same configuration as given in the checkpoint file. I feel that this is highlighted in the "Word of Caution and Possible Errors" documentation section (https://pykeen.readthedocs.io/en/stable/tutorial/checkpoints.html#word-of-caution-and-possible-errors), but still unclear what are the next steps.
How do we resume training from the previous checkpoint?