laiguokun / Funnel-Transformer

MIT License
212 stars 17 forks source link

Pretraining Issues #6

Open nemani opened 4 years ago

nemani commented 4 years ago

Hey, I am trying to train Funnel Transformer with the following hparams, the cpu usage for my TPUv3-8 has not gone above 4% in the 90 hours the code has been running and it seems to be very slow, took approx 90 hours for 9000 steps.

Do you guys think there is something wrong here or is this time expected?

{
    "block_size": "6_6_6",
    "d_embed": 1024,
    "d_head": 64,
    "d_inner": 4096,
    "d_model": 1024,
    "decoder_size": "2",
    "dropact": 0.0,
    "dropatt": 0.1,
    "dropout": 0.1,
    "ff_activation": "gelu",
    "init": "truncated_normal",
    "init_range": 0.1,
    "init_std": 0.02,
    "n_head": 16,
    "pool_q_only": true,
    "pooling_size": 2,
    "pooling_type": "mean",
    "rel_attn_type": "factorized",
    "separate_cls": true,
    "vocab_size": 32000
}
zihangdai commented 4 years ago

For TPUv3-8, the model could be a bit too large (you are training to train a model 2x larger than BERT-base with 1/4 of the original computation). But the reported speed is still way too slow (90hrs for 9K steps). So, in order to help you, could you provide more info. w.r.t. sequence length, batch size, your data (storage, processing), and other experimental settings?

nemani commented 4 years ago

I have 100 TFrecord data files, each is around 700MB, I know its a lot of data. The files are stored on google storage bucket.

Exact command which I ran:

python3 pretrain.py \
  --use_tpu=True \
  --tpu=funnel1 \
  --model_dir=gs://{path_to_model_dir} \
  --use_bfloat16=True \
  --num_hosts=1 \
  --num_core_per_host=8 \
  --loss_type=mlm \
  --num_passes=5 \
  --record_dir=gs://{path_to_tfrecord_dir} \
  --train_batch_size=256 \
  --learning_rate=1e-4 \
  --seq_len=512 \
  --num_predict=85 \
  --train_steps=1000000 \
  --warmup_steps=10000 \
  --block_size="6_6_6" \
  --decoder_size="2" \
  --d_model=1024 \
  --d_embed=1024 \
  --n_head=16 \
  --tokenizer_path={path_to_vocab_file} \
  --d_head=64 \
  --d_inner=4096 \
  --pool_q_only=True \
  --verbose=False
nemani commented 4 years ago

Not sure if this is related, but the logs are filled with the following lines:

transport.py:157] Attempting refresh to obtain initial access_token
zihangdai commented 4 years ago

What I would recommend right now is to decrease the d_model and d_inner. You are currently using 1024/4096 which is too large for TPUV3-8. But I currently don't have access to TPUs and cannot try the exact setting myself. From my own experience, with TPUV3-8, I probably would try B6-6-6H512 with loss_type=electra and seq_len=128 first. The performance won't be as good as a large model. But under this setting, the training will be finished within 1 or 2 days.

nemani commented 4 years ago

Okay thanks, I'll try with this config!

On Sun, Aug 2, 2020, 10:57 PM Zihang Dai notifications@github.com wrote:

What I would recommend right now is to decrease the d_model and d_inner. You are currently using 1024/4096 which is too large for TPUV3-8. But I currently don't have access to TPUs and cannot try the exact setting myself. From my own experience, with TPUV3-8, I probably would try B6-6-6H512 with loss_type=electra and seq_len=128 first. The performance won't be as good as a large model. But under this setting, the training will be finished within 1 or 2 days.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/laiguokun/Funnel-Transformer/issues/6#issuecomment-667702167, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADX44NRSBF6F2TXI4YJLV5DR6WOWVANCNFSM4PSUENRA .

nemani commented 4 years ago

If I decrease the seq len I will have to create the tf records again, keeping it the same for now. Do you think I should increase the batch size? Google documentation seems to suggest we should try to use 1024.

https://cloud.google.com/tpu/docs/troubleshooting#batch-too-small

Running with following

.
.

  --num_predict=85 \
  --train_steps=1000000 \
  --warmup_steps=10000 \
  --loss_type=electra \
  --tpu=funnel4 \
  --num_passes=5 \
  --learning_rate=1e-4 \
  --seq_len=512 \
  --block_size="6_6_6" \
  --decoder_size="2" \
  --d_model=512 \
  --d_embed=512 \
  --n_head=8 \
  --d_head=64 \
  --d_inner=2048 \
  --train_batch_size=512
zihangdai commented 4 years ago

That document was created for training convnets on images. For TPUV3-8, you only have 8 TPU-cores on a single host. So, 256 is a very demanding batch size (32/core) as a matter of fact. Similarly, using seq_len 512 will also make the thing quite slow.

As a comparison, think about the recommended bsz per core for just fine-tuning pretrained models on 8 16G-GPUs for length-512 sequences.