Open ghaddarAbs opened 3 years ago
This never happened to us before so I am not quite sure what is happening. few suggestions
(1) Can you plot the loss curve and see if it is going down before it collapse?
(2) If you run it multiple times, does it always happen?
(3) Another option is to run without fp16, and see it could run successfully
(4) Maybe you could try lower the learning rate, and see if it works.
Please keep us updated.
Thanks
Thank you @intersun for your response.
(1) I will print the lr curve to see what happend.
(2) Yes, we tried it multiple time and it always happen.
(3) We run it without fp32 and still gets a lots of Inf/Nan in loss/
before it crashes.
(4) We tried with low learning rate but it didn't work.
Also, I get these logs just before loading the data (pretrain.py), I don't know if they are related with the issue we are getting.
[1,0]<stderr>:Weights of BertEncoder not initialized from pretrained model: ['encode_proj.0.weight', 'encode_proj.
0.bias', 'encode_proj.2.weight', 'encode_proj.2.bias', 'encode_proj.3.weight', 'encode_proj.3.bias'] [1,0]<stderr>:Weights from pretrained model not used in BertEncoder: ['cls.predictions.bias', 'cls.predictions.tra
nsform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relations
hip.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
[1,1]<stderr>:Weights of BertEncoder not initialized from pretrained model: ['encode_proj.0.weight', 'encode_proj.
0.bias', 'encode_proj.2.weight', 'encode_proj.2.bias', 'encode_proj.3.weight', 'encode_proj.3.bias']
[1,1]<stderr>:Weights from pretrained model not used in BertEncoder: ['cls.predictions.bias', 'cls.predictions.tra
nsform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relations
hip.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform
.LayerNorm.bias']
[1,3]<stderr>:Weights of BertEncoder not initialized from pretrained model: ['encode_proj.0.weight', 'encode_proj.
0.bias', 'encode_proj.2.weight', 'encode_proj.2.bias', 'encode_proj.3.weight', 'encode_proj.3.bias']
[1,3]<stderr>:Weights from pretrained model not used in BertEncoder: ['cls.predictions.bias', 'cls.predictions.tra
nsform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relations
hip.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform
.LayerNorm.bias']
][1,2]<stderr>:Weights of BertEncoder not initialized from pretrained model: ['encode_proj.0.weight', 'encode_proj.
0.bias', 'encode_proj.2.weight', 'encode_proj.2.bias', 'encode_proj.3.weight', 'encode_proj.3.bias']
[1,2]<stderr>:Weights from pretrained model not used in BertEncoder: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relations
hip.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform
.LayerNorm.bias']
@ChenRocks Can you help @ghaddarAbs verify this ZeroDivisionError? Did this also happen to UNITER pretraining? In my pre-training this never happened :(
You should not see the apex loss scaler reducing the loss scale to less than 1.
[1,0]<stdout>:Gradient overflow. Skipping step, loss scaler 5 reducing loss scale to 4.3601508761683463e-106
The training probably went wrong way earlier then the ZeroDivisionError.
The data downloaded from UNITER should be compatible with this repo. The only difference is the name change. In UNITER/LightningDOT you should never see this loss scaler error if you follow the original code/config. In my other projects, I have seen this issue becase I used some fp16-unsafe layer (nn.BCELoss
) and changing it to the fp16-safe variant (nn.BCEWithLogitsLoss
) fixed it.
Hi,
Thank you very much for the great work, and for making your code publicly available. I am trying to run the code to reproduce the results, however, the pre-training datasets are missing from the download script. Is it possible to upload the pretraining data, similar to what you did for the fine-tuning ones last week?
In fact, I tried to use
coco
andvg
datasets distributed by the UNITER code, while adjusting the train/val dataset in./config/pretrain-alldata-base.json
as follow:Surprisingly, the pretraining code worked, but I get another issue. I got gradient overflow at the beginning of the training and then this error at 3%: ZeroDivisionError: float division by zero
Here are some logs for gradient overflow
and here is the log of the error:
I understand why this error is happening, the loss gradually gets smaller until it became 0. However, I can't understand what to do to solve this error? I looked at the issues in apex and it seems that I have bad input that is causing the issue. So my conclusion was that I am not using the correct pretraining dataset.
Can you please share the pretraining data?
Thanks