marius-team / marius

Large scale graph learning on a single machine.
https://marius-project.org
Apache License 2.0
160 stars 45 forks source link

Model checkpoint bug #95

Closed basavaraj29 closed 2 years ago

basavaraj29 commented 2 years ago
  1. I think the earlier code logged the optimizers to the model_state_file in the following fashion, which led to overwritting optimizer entries.
    num_steps: <>
    op1_attr1: <val>
    ...
    op2_attr1: <val>

Changed it to log in the following manner.

"0" : {
  num_steps: <>
  op1_attr1: <val>
},
"1" : {
  num_steps: <>
  op2_attr1: <val>
}
  1. Another bug - In the load phase, optimizers were being pushed twice into the queue. Once as part of initModelFromConfig, and again in Model::load. Changed Model::load to work on the previously pushed optimizer objects instead.

  2. if resume_training is set to true, the existing model_dir will be overwritten with the new model. If resume_from_checkpoint is set, then the new model is output to model_dir (either user-specified or newly created model_x where x belongs to [0,10] )

  3. Testing resume_training with and without model_dir/resume_from_checkpoint specified.

shivaram commented 2 years ago

Also a minor nit - It would be good to write more descriptive PR title and description. For example in this case, what is the bug, when is it triggered, how did you fix it etc.