Pretraining dataset - Githubissues

ghaddarAbs commented 3 years ago

Hi,

Thank you very much for the great work, and for making your code publicly available. I am trying to run the code to reproduce the results, however, the pre-training datasets are missing from the download script. Is it possible to upload the pretraining data, similar to what you did for the fine-tuning ones last week?

In fact, I tried to use coco and vg datasets distributed by the UNITER code, while adjusting the train/val dataset in ./config/pretrain-alldata-base.json as follow:

{
       "name": "coco_cap",
       "db": [
           "/path/to//uniter/txt_db/pretrain_coco_train.db/",
           "/path/to//uniter/txt_db/pretrain_coco_val.db/"
       ],
       "img": [
           "/path/to//uniter/img_db/coco_train2014/",
           "/path/to//uniter/img_db/coco_val2014/"
       ],
       "tasks": [
           "itm",
           "mlm",
           "mrfr",
           "mrckl"
       ],
       "mix_ratio": [
           16,
           8,
           4,
           4
       ]
   },
   {
       "name": "vg_cap",
       "db": [
           "/path/to//uniter/txt_db/pretrain_vg_train.db/"
       ],
       "img": [
           "/path/to//uniter/img_db/vg/"
       ],
       "tasks": [
           "itm",
           "mlm",
           "mrfr",
           "mrckl"
       ],
       "mix_ratio": [
           16,
           12,
           6,
           6
       ]
   }
],
"val_datasets": [
   {
       "name": "coco_cap",
       "db": [
           "/path/to//uniter/txt_db/pretrain_coco_val.db/"
       ],
       "img": [
           "/path/to//uniter/img_db/coco_val2014/"
       ],
       "tasks": [
           "itm",
           "mlm",
           "mrfr",
           "mrckl"
       ]
   },
   {
       "name": "vg_cap",
       "db": [
           "/path/to//uniter/txt_db/pretrain_vg_val.db/"
       ],
       "img": [
           "/path/to//uniter/img_db/vg/"
       ],
       "tasks": [
           "itm",
           "mlm",
           "mrfr",
           "mrckl"
       ]
   }

Surprisingly, the pretraining code worked, but I get another issue. I got gradient overflow at the beginning of the training and then this error at 3%: ZeroDivisionError: float division by zero

Here are some logs for gradient overflow

[1,2]<stdout>:Gradient overflow.  Skipping step, loss scaler 5 reducing loss scale to 4.3601508761683463e-106
[1,1]<stdout>:Gradient overflow.  Skipping step, loss scaler 5 reducing loss scale to 4.3601508761683463e-106
[1,3]<stdout>:Gradient overflow.  Skipping step, loss scaler 5 reducing loss scale to 4.3601508761683463e-106
[1,0]<stdout>:Gradient overflow.  Skipping step, loss scaler 5 reducing loss scale to 4.3601508761683463e-106
  3%|▎         | 8792/300000 [2:51:23<79:18:44,  1.02it/s][1,1]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,0]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,3]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,2]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,3]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,2]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,1]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,0]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,1]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,2]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,0]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,3]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,3]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,2]<stdout>:Inf/Nan in loss/mrfr_coco_cap

and here is the log of the error:

[1,0]<stderr>:ZeroDivisionError: float division by zero
  3%|▎         | 8856/300000 [2:52:34<94:33:17,  1.17s/it]--------------------------------------------------------
------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

I understand why this error is happening, the loss gradually gets smaller until it became 0. However, I can't understand what to do to solve this error? I looked at the issues in apex and it seems that I have bad input that is causing the issue. So my conclusion was that I am not using the correct pretraining dataset.

Can you please share the pretraining data?

Thanks

intersun commented 3 years ago

This never happened to us before so I am not quite sure what is happening. few suggestions

(1) Can you plot the loss curve and see if it is going down before it collapse?
(2) If you run it multiple times, does it always happen? (3) Another option is to run without fp16, and see it could run successfully (4) Maybe you could try lower the learning rate, and see if it works.

Please keep us updated.

Thanks

ghaddarAbs commented 3 years ago

Thank you @intersun for your response.

(1) I will print the lr curve to see what happend. (2) Yes, we tried it multiple time and it always happen. (3) We run it without fp32 and still gets a lots of Inf/Nan in loss/ before it crashes. (4) We tried with low learning rate but it didn't work.

ghaddarAbs commented 3 years ago

Also, I get these logs just before loading the data (pretrain.py), I don't know if they are related with the issue we are getting.


[1,0]<stderr>:Weights of BertEncoder not initialized from pretrained model: ['encode_proj.0.weight', 'encode_proj.
0.bias', 'encode_proj.2.weight', 'encode_proj.2.bias', 'encode_proj.3.weight', 'encode_proj.3.bias']              [1,0]<stderr>:Weights from pretrained model not used in BertEncoder: ['cls.predictions.bias', 'cls.predictions.tra
nsform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relations
hip.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
[1,1]<stderr>:Weights of BertEncoder not initialized from pretrained model: ['encode_proj.0.weight', 'encode_proj.
0.bias', 'encode_proj.2.weight', 'encode_proj.2.bias', 'encode_proj.3.weight', 'encode_proj.3.bias']
[1,1]<stderr>:Weights from pretrained model not used in BertEncoder: ['cls.predictions.bias', 'cls.predictions.tra
nsform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relations
hip.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform
.LayerNorm.bias']
[1,3]<stderr>:Weights of BertEncoder not initialized from pretrained model: ['encode_proj.0.weight', 'encode_proj.
0.bias', 'encode_proj.2.weight', 'encode_proj.2.bias', 'encode_proj.3.weight', 'encode_proj.3.bias']
[1,3]<stderr>:Weights from pretrained model not used in BertEncoder: ['cls.predictions.bias', 'cls.predictions.tra
nsform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relations
hip.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform
.LayerNorm.bias']

][1,2]<stderr>:Weights of BertEncoder not initialized from pretrained model: ['encode_proj.0.weight', 'encode_proj.
0.bias', 'encode_proj.2.weight', 'encode_proj.2.bias', 'encode_proj.3.weight', 'encode_proj.3.bias']
[1,2]<stderr>:Weights from pretrained model not used in BertEncoder: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relations
hip.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform
.LayerNorm.bias']

intersun commented 3 years ago

@ChenRocks Can you help @ghaddarAbs verify this ZeroDivisionError? Did this also happen to UNITER pretraining? In my pre-training this never happened :(

ChenRocks commented 3 years ago

You should not see the apex loss scaler reducing the loss scale to less than 1.

[1,0]<stdout>:Gradient overflow.  Skipping step, loss scaler 5 reducing loss scale to 4.3601508761683463e-106

The training probably went wrong way earlier then the ZeroDivisionError.

The data downloaded from UNITER should be compatible with this repo. The only difference is the name change. In UNITER/LightningDOT you should never see this loss scaler error if you follow the original code/config. In my other projects, I have seen this issue becase I used some fp16-unsafe layer (nn.BCELoss) and changing it to the fp16-safe variant (nn.BCEWithLogitsLoss) fixed it.

intersun / LightningDOT

Pretraining dataset #4