mlverse / luz

Higher Level API for torch
https://mlverse.github.io/luz/
Other
85 stars 13 forks source link

Cannot resume from checkpoint with a callback #149

Open jiho opened 2 months ago

jiho commented 2 months ago

Hi,

I have a simple CNN which looks like

mobnet <- nn_module(
  initialize = function() {
    # get MobileNet
    self$model <- model_mobilenet_v2(pretrained=TRUE)
    # replace MobileNet's classifier with our own
    self$model$classifier <- nn_sequential(
      nn_dropout(p=0.2),
      nn_linear(in_features=1280, out_features=64),
      nn_relu(),
      nn_dropout(p=0.2),
      nn_linear(in_features=64, out_features=64),
      nn_relu(),
      nn_linear(in_features=64, out_features=n_classes)
    )
  },
  forward = function(x) {
    self$model(x)
  }
)

I have training and validation dataset and dataloaders, built with image_folder_dataset. I train with

library("luz")
checkpoint <- luz_callback_model_checkpoint(
  path = "checkpoints/", 
  monitor = "train_loss"
)
resume <- luz_callback_resume_from_checkpoint(path = "checkpoints/")
mobnet_fit <- mobnet |>
  setup(
    loss = nn_cross_entropy_loss(),
    optimizer = optim_adam,
    metrics = list(luz_metric_accuracy())
  ) |>
  set_opt_hparams(lr = 0.003) |>
  fit(dl_train, epochs=20, valid_data=dl_valid, callbacks=list(resume, checkpoint))

This correctly saves checkpoint but if I interrupt training before the end and re-run the last portion of the code (from mobnet_fit <- ...), training systematically restarts from scratch. What I am doing wrong?

PS: more generally, is there a forum/mailing list/etc. where such question could be asked since it may not be a bug/issue but rather a misunderstanding on my part.