No decreasing loss when pre-train for xxlarge

jwkim912 commented 4 years ago

Hi, I'm pre-training xxlarge model using own language. I trained on TPU-v2-256 but loss is not decreasing. Below is the learning information.

vocab size: 33001
training data size: 518G ( dupe factor: 10)
max_seq_length: 512
3 gram masking, using SOP
word size: 5 B
batch size: 512
optimizer: lamb
learning rate: 0.00176

I1211 08:56:02.464132 140024623753024 tpu_estimator.py:1201] Found small feature: next_sentence_labels [2, 1] INFO:tensorflow:<DatasetV1Adapter shapes: {input_ids: (2, 512), input_mask: (2, 512), masked_lm_ids: (2, 77), masked_lm_positions: (2, 77), masked_lm_weights: (2, 77), next_sentence_labels: (2, 1), segment_ids: (2, 512)}, types: {input_ids: tf.int32, input_mask: tf.int32, masked_lm_ids: tf.int32, masked_lm_positions: tf.int32, masked_lm_weights: tf.float32, next_sentence_labels: tf.int32, segment_ids: tf.int32}> I1211 08:56:02.510196 140024623753024 run_pretraining.py:457] <DatasetV1Adapter shapes: {input_ids: (2, 512), input_mask: (2, 512), masked_lm_ids: (2, 77), masked_lm_positions: (2, 77), masked_lm_weights: (2, 77), next_sentence_labels: (2, 1), segment_ids: (2, 512)}, types: {input_ids: tf.int32, input_mask: tf.int32, masked_lm_ids: tf.int32, masked_lm_positions: tf.int32, masked_lm_weights: tf.float32, next_sentence_labels: tf.int32, segment_ids: tf.int32}> INFO:tensorflow:Found small feature: next_sentence_labels [2, 1] I1211 08:56:02.523885 140024623753024 tpu_estimator.py:1201] Found small feature: next_sentence_labels [2, 1] INFO:tensorflow:Found small feature: next_sentence_labels [2, 1] I1211 08:56:02.526081 140024623753024 tpu_estimator.py:1201] Found small feature: next_sentence_labels [2, 1] INFO:tensorflow:Found small feature: next_sentence_labels [2, 1] I1211 08:56:02.527927 140024623753024 tpu_estimator.py:1201] Found small feature: next_sentence_labels [2, 1] INFO:tensorflow:Found small feature: next_sentence_labels [2, 1] I1211 08:56:02.529864 140024623753024 tpu_estimator.py:1201] Found small feature: next_sentence_labels [2, 1] INFO:tensorflow:Found small feature: next_sentence_labels [2, 1] I1211 08:56:02.531889 140024623753024 tpu_estimator.py:1201] Found small feature: next_sentence_labels [2, 1] INFO:tensorflow:Found small feature: next_sentence_labels [2, 1] I1211 08:56:02.533753 140024623753024 tpu_estimator.py:1201] Found small feature: next_sentence_labels [2, 1] INFO:tensorflow:Found small feature: next_sentence_labels [2, 1] I1211 08:56:02.535558 140024623753024 tpu_estimator.py:1201] Found small feature: next_sentence_labels [2, 1] INFO:tensorflow:Found small feature: next_sentence_labels [2, 1] I1211 08:56:02.537545 140024623753024 tpu_estimator.py:1201] Found small feature: next_sentence_labels [2, 1] 2019-12-11 08:56:02.673414: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory 2019-12-11 08:56:02.673472: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (303) 2019-12-11 08:56:02.673496: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (instance-2): /proc/driver/nvidia/version does not exist INFO:tensorflow: Features I1211 08:56:02.704437 140024623753024 run_pretraining.py:150] Features INFO:tensorflow: name = input_ids, shape = (2, 512) I1211 08:56:02.704912 140024623753024 run_pretraining.py:152] name = input_ids, shape = (2, 512) INFO:tensorflow: name = input_mask, shape = (2, 512) I1211 08:56:02.705025 140024623753024 run_pretraining.py:152] name = input_mask, shape = (2, 512) INFO:tensorflow: name = masked_lm_ids, shape = (2, 77) I1211 08:56:02.705091 140024623753024 run_pretraining.py:152] name = masked_lm_ids, shape = (2, 77) INFO:tensorflow: name = masked_lm_positions, shape = (2, 77) I1211 08:56:02.705152 140024623753024 run_pretraining.py:152] name = masked_lm_positions, shape = (2, 77) INFO:tensorflow: name = masked_lm_weights, shape = (2, 77) I1211 08:56:02.705220 140024623753024 run_pretraining.py:152] name = masked_lm_weights, shape = (2, 77) INFO:tensorflow: name = next_sentence_labels, shape = (2, 1) I1211 08:56:02.705290 140024623753024 run_pretraining.py:152] name = next_sentence_labels, shape = (2, 1) INFO:tensorflow: name = segment_ids, shape = (2, 512) I1211 08:56:02.705374 140024623753024 run_pretraining.py:152] name = segment_ids, shape = (2, 512)

INFO:tensorflow: Trainable Variables I1211 08:56:04.239879 140024623753024 run_pretraining.py:220] Trainable Variables INFO:tensorflow: name = bert/embeddings/word_embeddings:0, shape = (33001, 128) I1211 08:56:04.239998 140024623753024 run_pretraining.py:226] name = bert/embeddings/word_embeddings:0, shape = (33001, 128) INFO:tensorflow: name = bert/embeddings/token_type_embeddings:0, shape = (2, 128) I1211 08:56:04.240141 140024623753024 run_pretraining.py:226] name = bert/embeddings/token_type_embeddings:0, shape = (2, 128) INFO:tensorflow: name = bert/embeddings/position_embeddings:0, shape = (512, 128) I1211 08:56:04.240252 140024623753024 run_pretraining.py:226] name = bert/embeddings/position_embeddings:0, shape = (512, 128) INFO:tensorflow: name = bert/embeddings/LayerNorm/beta:0, shape = (128,) I1211 08:56:04.240369 140024623753024 run_pretraining.py:226] name = bert/embeddings/LayerNorm/beta:0, shape = (128,) INFO:tensorflow: name = bert/embeddings/LayerNorm/gamma:0, shape = (128,) I1211 08:56:04.240468 140024623753024 run_pretraining.py:226] name = bert/embeddings/LayerNorm/gamma:0, shape = (128,) INFO:tensorflow: name = bert/encoder/embedding_hidden_mapping_in/kernel:0, shape = (128, 4096) I1211 08:56:04.240564 140024623753024 run_pretraining.py:226] name = bert/encoder/embedding_hidden_mapping_in/kernel:0, shape = (128, 4096) INFO:tensorflow: name = bert/encoder/embedding_hidden_mapping_in/bias:0, shape = (4096,) I1211 08:56:04.240664 140024623753024 run_pretraining.py:226] name = bert/encoder/embedding_hidden_mapping_in/bias:0, shape = (4096,) INFO:tensorflow: name = bert/encoder/transformer/group_0/inner_group_0/attention_1/self/query/kernel:0, shape = (4096, 4096) I1211 08:56:04.240769 140024623753024 run_pretraining.py:226] name = bert/encoder/transformer/group_0/inner_group_0/attention_1/self/query/kernel:0, shape = (4096, 4096) INFO:tensorflow: name = bert/encoder/transformer/group_0/inner_group_0/attention_1/self/query/bias:0, shape = (4096,) I1211 08:56:04.240869 140024623753024 run_pretraining.py:226] name = bert/encoder/transformer/group_0/inner_group_0/attention_1/self/query/bias:0, shape = (4096,) INFO:tensorflow: name = bert/encoder/transformer/group_0/inner_group_0/attention_1/self/key/kernel:0, shape = (4096, 4096) I1211 08:56:04.240964 140024623753024 run_pretraining.py:226] name = bert/encoder/transformer/group_0/inner_group_0/attention_1/self/key/kernel:0, shape = (4096, 4096) INFO:tensorflow: name = bert/encoder/transformer/group_0/inner_group_0/attention_1/self/key/bias:0, shape = (4096,) I1211 08:56:04.241075 140024623753024 run_pretraining.py:226] name = bert/encoder/transformer/group_0/inner_group_0/attention_1/self/key/bias:0, shape = (4096,) INFO:tensorflow: name = bert/encoder/transformer/group_0/inner_group_0/attention_1/self/value/kernel:0, shape = (4096, 4096) I1211 08:56:04.241171 140024623753024 run_pretraining.py:226] name = bert/encoder/transformer/group_0/inner_group_0/attention_1/self/value/kernel:0, shape = (4096, 4096) INFO:tensorflow: name = bert/encoder/transformer/group_0/inner_group_0/attention_1/self/value/bias:0, shape = (4096,) I1211 08:56:04.241268 140024623753024 run_pretraining.py:226] name = bert/encoder/transformer/group_0/inner_group_0/attention_1/self/value/bias:0, shape = (4096,) INFO:tensorflow: name = bert/encoder/transformer/group_0/inner_group_0/attention_1/output/dense/kernel:0, shape = (4096, 4096) I1211 08:56:04.241392 140024623753024 run_pretraining.py:226] name = bert/encoder/transformer/group_0/inner_group_0/attention_1/output/dense/kernel:0, shape = (4096, 4096) INFO:tensorflow: name = bert/encoder/transformer/group_0/inner_group_0/attention_1/output/dense/bias:0, shape = (4096,) I1211 08:56:04.241534 140024623753024 run_pretraining.py:226] name = bert/encoder/transformer/group_0/inner_group_0/attention_1/output/dense/bias:0, shape = (4096,) INFO:tensorflow: name = bert/encoder/transformer/group_0/inner_group_0/LayerNorm/beta:0, shape = (4096,) I1211 08:56:04.241631 140024623753024 run_pretraining.py:226] name = bert/encoder/transformer/group_0/inner_group_0/LayerNorm/beta:0, shape = (4096,) INFO:tensorflow: name = bert/encoder/transformer/group_0/inner_group_0/LayerNorm/gamma:0, shape = (4096,) I1211 08:56:04.241748 140024623753024 run_pretraining.py:226] name = bert/encoder/transformer/group_0/inner_group_0/LayerNorm/gamma:0, shape = (4096,) INFO:tensorflow: name = bert/encoder/transformer/group_0/inner_group_0/ffn_1/intermediate/dense/kernel:0, shape = (4096, 16384) I1211 08:56:04.241850 140024623753024 run_pretraining.py:226] name = bert/encoder/transformer/group_0/inner_group_0/ffn_1/intermediate/dense/kernel:0, shape = (4096, 16384) INFO:tensorflow: name = bert/encoder/transformer/group_0/inner_group_0/ffn_1/intermediate/dense/bias:0, shape = (16384,) I1211 08:56:04.241949 140024623753024 run_pretraining.py:226] name = bert/encoder/transformer/group_0/inner_group_0/ffn_1/intermediate/dense/bias:0, shape = (16384,) INFO:tensorflow: name = bert/encoder/transformer/group_0/inner_group_0/ffn_1/intermediate/output/dense/kernel:0, shape = (16384, 4096) I1211 08:56:04.242043 140024623753024 run_pretraining.py:226] name = bert/encoder/transformer/group_0/inner_group_0/ffn_1/intermediate/output/dense/kernel:0, shape = (16384, 4096) INFO:tensorflow: name = bert/encoder/transformer/group_0/inner_group_0/ffn_1/intermediate/output/dense/bias:0, shape = (4096,) I1211 08:56:04.242140 140024623753024 run_pretraining.py:226] name = bert/encoder/transformer/group_0/inner_group_0/ffn_1/intermediate/output/dense/bias:0, shape = (4096,) INFO:tensorflow: name = bert/encoder/transformer/group_0/inner_group_0/LayerNorm_1/beta:0, shape = (4096,) I1211 08:56:04.242233 140024623753024 run_pretraining.py:226] name = bert/encoder/transformer/group_0/inner_group_0/LayerNorm_1/beta:0, shape = (4096,) INFO:tensorflow: name = bert/encoder/transformer/group_0/inner_group_0/LayerNorm_1/gamma:0, shape = (4096,) I1211 08:56:04.242332 140024623753024 run_pretraining.py:226] name = bert/encoder/transformer/group_0/inner_group_0/LayerNorm_1/gamma:0, shape = (4096,) INFO:tensorflow: name = bert/pooler/dense/kernel:0, shape = (4096, 4096) I1211 08:56:04.242433 140024623753024 run_pretraining.py:226] name = bert/pooler/dense/kernel:0, shape = (4096, 4096) INFO:tensorflow: name = bert/pooler/dense/bias:0, shape = (4096,) I1211 08:56:04.242532 140024623753024 run_pretraining.py:226] name = bert/pooler/dense/bias:0, shape = (4096,) INFO:tensorflow: name = cls/predictions/transform/dense/kernel:0, shape = (4096, 128) I1211 08:56:04.242635 140024623753024 run_pretraining.py:226] name = cls/predictions/transform/dense/kernel:0, shape = (4096, 128) INFO:tensorflow: name = cls/predictions/transform/dense/bias:0, shape = (128,) I1211 08:56:04.242760 140024623753024 run_pretraining.py:226] name = cls/predictions/transform/dense/bias:0, shape = (128,) INFO:tensorflow: name = cls/predictions/transform/LayerNorm/beta:0, shape = (128,) I1211 08:56:04.242856 140024623753024 run_pretraining.py:226] name = cls/predictions/transform/LayerNorm/beta:0, shape = (128,) INFO:tensorflow: name = cls/predictions/transform/LayerNorm/gamma:0, shape = (128,) I1211 08:56:04.242951 140024623753024 run_pretraining.py:226] name = cls/predictions/transform/LayerNorm/gamma:0, shape = (128,) INFO:tensorflow: name = cls/predictions/output_bias:0, shape = (33001,) I1211 08:56:04.243044 140024623753024 run_pretraining.py:226] name = cls/predictions/output_bias:0, shape = (33001,) INFO:tensorflow: name = cls/seq_relationship/output_weights:0, shape = (2, 4096) I1211 08:56:04.243137 140024623753024 run_pretraining.py:226] name = cls/seq_relationship/output_weights:0, shape = (2, 4096) INFO:tensorflow: name = cls/seq_relationship/output_bias:0, shape = (2,) I1211 08:56:04.243235 140024623753024 run_pretraining.py:226] name = cls/seq_relationship/output_bias:0, shape = (2,)

I1211 09:12:03.138811 140024623753024 basic_session_run_hooks.py:262] loss = 10.181114, step = 1000 I1211 09:26:09.008900 140024623753024 basic_session_run_hooks.py:260] loss = 7.6005945, step = 2000 (845.870 sec) I1211 09:40:12.286720 140024623753024 basic_session_run_hooks.py:260] loss = 7.645055, step = 3000 (843.278 sec) I1211 09:54:16.299396 140024623753024 basic_session_run_hooks.py:260] loss = 7.6258326, step = 4000 (844.013 sec) I1211 10:08:19.825035 140024623753024 basic_session_run_hooks.py:260] loss = 7.363482, step = 5000 (843.526 sec) I1211 10:22:25.123742 140024623753024 basic_session_run_hooks.py:260] loss = 6.8203845, step = 6000 (845.299 sec) I1211 10:36:29.082039 140024623753024 basic_session_run_hooks.py:260] loss = 6.5194592, step = 7000 (843.958 sec) I1211 10:50:31.896788 140024623753024 basic_session_run_hooks.py:260] loss = 6.854472, step = 8000 (842.815 sec) I1211 11:04:36.726402 140024623753024 basic_session_run_hooks.py:260] loss = 7.0283566, step = 9000 (844.830 sec) I1211 11:19:29.132026 140024623753024 basic_session_run_hooks.py:260] loss = 6.5989375, step = 10000 (892.406 sec) I1211 11:33:32.866184 140024623753024 basic_session_run_hooks.py:260] loss = 6.550018, step = 11000 (843.734 sec) ... ... I1211 13:41:01.039676 140024623753024 basic_session_run_hooks.py:260] loss = 6.5004697, step = 20000 (894.206 sec) ... I1211 16:02:31.998177 140024623753024 basic_session_run_hooks.py:260] loss = 7.100818, step = 30000 (892.416 sec) ... I1211 18:24:15.941736 140024623753024 basic_session_run_hooks.py:260] loss = 6.5937705, step = 40000 (896.439 sec) ... I1211 20:45:50.533722 140024623753024 basic_session_run_hooks.py:260] loss = 5.950697, step = 50000 (895.989 sec) ... I1211 23:07:25.169874 140024623753024 basic_session_run_hooks.py:260] loss = 6.789865, step = 60000 (893.845 sec) ... I1212 01:28:58.518174 140024623753024 basic_session_run_hooks.py:260] loss = 6.453152, step = 70000 (892.751 sec) ... I1212 03:50:25.943136 140024623753024 basic_session_run_hooks.py:260] loss = 6.7387037, step = 80000 (889.578 sec)

What's wrong?

lanzhzh commented 4 years ago

For xxlarge, you will need to start from a network with lower number of layers, as described in the paper, or reduce the initializer_range to be 0.01

lanzhzh commented 4 years ago

Can you change the title to be "No decreasing loss when pre-train for xxlarge" so that other people who see the same problem can find the answer here? Thanks!

jwkim912 commented 4 years ago

@lanzhzh Thanks! I'll try it!

jwkim912 commented 4 years ago

@lanzhzh I trained model using this configuration.

{ "attention_probs_dropout_prob": 0, "hidden_act": "gelu", "hidden_dropout_prob": 0, "embedding_size": 128, "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 16384, "max_position_embeddings": 512, "num_attention_heads": 64, "num_hidden_layers": 12, "num_hidden_groups": 1, "net_structure_type": 0, "layers_to_keep": [], "gap_size": 0, "num_memory_blocks": 0, "inner_group_num": 1, "down_scale_factor": 1, "type_vocab_size": 2, "vocab_size": 33001 }

Should I set it smaller than 12-layer?

Danny-Google commented 4 years ago

How about set the initializer_range to be 0.01 first?

jwkim912 commented 4 years ago

@lanzhzh @Danny-Google Thanks! I set the initializer_range to be 0.01 first. Fortunately, the loss fell.

However, the loss increases from 220k steps. Do I have to re-train from 220k step?

peregilk commented 4 years ago

@rmard90. I am pre-training "large" from scratch on a single TPUv3-128. My corpus is smaller (1B words). Basic settings are the same as described in https://tfhub.dev/google/albert_large/2, with "initializer_range" set to 0.02 and no dropout.

Even if the settings are very different, loss is developing very similar to your example, with a sudden increase and then no further improvement:

seongwook-ham commented 4 years ago

it seems that batch_size is different from original paper. so training step and warm up step should be adapted properly. and also learning rate should be scaled. see LARGE BATCH OPTIMIZATION FOR DEEP LEARNING:TRAINING BERT IN 76 MINUTES(https://arxiv.org/abs/1904.00962) i think in this case learning rate should be scaled 1/(2^1.5) times and warm up step stay same, training step scaled 8 times.

Danny-Google commented 4 years ago

@rmard90 This is likely caused by having too larger of a learning rate at 220k step. Decreasing the learning rate of using a more aggressive scaling rate like the one given by @seongwook-ham should help. It is difficult to see the progress from the loss. Having a small validation set would be helpful.

peregilk commented 4 years ago

Decreasing the learning rate seem to have solved my problem of "exploding" loss. Progress is better.

jwkim912 commented 4 years ago

@peregilk Did you relearn from the step of increasing loss or learn from the beginning by reducing the learning rate?

peregilk commented 4 years ago

@rmard90. I applied the reduced learning rate from the start. Compare the current development of the loss with the graph I posted above.

jwkim912 commented 4 years ago

@peregilk Thanks, I'll apply it.

0x0539 commented 4 years ago

Looks like this was resolved. Please reopen if not.

008karan commented 4 years ago

@peregilk Can you share the parameter... or describe how you have chosen your parameter? batch size, learning rate, warmup step, total step and others which you would like to mention. Have used https://arxiv.org/pdf/1904.00962.pdf I am using total step 200000, LR 0.00088, bs 1200, warm-up step =3000

happy-nlp commented 2 years ago

This issue is really helpful.

google-research / albert

No decreasing loss when pre-train for xxlarge #29