Closed ZQS1943 closed 4 years ago
so sorry. I manually deleted all checkpoints and ran again, and the loss dropped. I don't know why.
I recently ran into a similar problem like this as well. Looking at your logs it might be the same problem: When the checkpoint at the 1st step (model_checkpoint-00000001) is loaded the model just won't optimize. Later checkpoints work fine though. This is really weird...
Also, I modified the code to support multi-GPU training, and when I set batch accumulation to 1 (no accumulation, just one big batch of 24 instead of 4 accumulations of 6) the model won't optimize. Setting it to 6x4 or 12x2 works.
I'm scratching my hair off trying to find the cause of this issue but to no avail.
Hi,
I'm trying to run your model but during training the loss does not drop.
Here is the part of the loss.
The loss remains unchanged in the two eval_on_train, as if the model has not been updated.
I did not use docker but installed related dependencies. Is the error caused by this reason? Thank you!!!