huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.72k stars 26.22k forks source link

Running run_translation.py with mt5 model, but loss is always 0.0 #22467

Closed SefaZeng closed 1 year ago

SefaZeng commented 1 year ago

System Info

transformers version 4.28.0.dev

Who can help?

No response

Information

Tasks

Reproduction

  1. training scripts:
    python3 -m torch.distributed.launch --nproc_per_node=8 \
    --nnodes=${WORLD_SIZE} --node_rank=${RANK} --master_addr=$MASTER_ADDR \
    --master_port=$MASTER_PORT ${code_dir}/run_translation.py \
    --model_name_or_path ${work_dir}/../pretrain_models/mt0-base \
    --train_file ${data_dir}/ja2zh.json \
    --validation_file ${data_dir}/ja2zh-head10.json \
    --source_lang ja \
    --target_lang zh \
    --source_prefix "translate Japanese to Chinese: " \
    --warmup_ratio 0.1 \
    --save_total_limit 10 \
    --save_steps 5000 \
    --logging_steps 1 \
    --weight_decay 0.001 \
    --adam_beta2 0.98 \
    --learning_rate 2e-4 \
    --num_train_epochs 1 \
    --gradient_accumulation_steps 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --cache_dir ${data_dir}/cache/ \
    --do_train \
    --do_eval \
    --fp16 \
    --output_dir ${ckpt_dir}/hf \
    --preprocessing_num_workers 40 \
    2>&1 |tee ${LOG_FILE}

    mt0-base is cloned from the huggingface. And the loss is always 0.0:

    [INFO|trainer.py:598] 2023-03-30 09:56:13,151 >> Using cuda_amp half precision backend
    /home/user/miniconda/lib/python3.8/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
    warnings.warn(
    /home/user/miniconda/lib/python3.8/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
    warnings.warn(
    [INFO|trainer.py:1743] 2023-03-30 09:56:13,677 >> ***** Running training *****
    [INFO|trainer.py:1744] 2023-03-30 09:56:13,677 >>   Num examples = 31729970
    [INFO|trainer.py:1745] 2023-03-30 09:56:13,677 >>   Num Epochs = 1
    [INFO|trainer.py:1746] 2023-03-30 09:56:13,677 >>   Instantaneous batch size per device = 8
    [INFO|trainer.py:1747] 2023-03-30 09:56:13,677 >>   Total train batch size (w. parallel, distributed & accumulation) = 32
    [INFO|trainer.py:1748] 2023-03-30 09:56:13,677 >>   Gradient Accumulation steps = 1
    [INFO|trainer.py:1749] 2023-03-30 09:56:13,677 >>   Total optimization steps = 991562
    [INFO|trainer.py:1750] 2023-03-30 09:56:13,680 >>   Number of trainable parameters = 1229581312
    [WARNING|logging.py:280] 2023-03-30 09:56:19,819 >> You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
    [WARNING|logging.py:280] 2023-03-30 09:56:20,010 >> You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
    [W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
    [W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
    {'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
    {'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
    {'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
    {'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
    {'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
    {'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
    {'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
    {'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}

    But if I try to train mt5 model from scratch with my mt data, the loss looks good. Did I miss something? Any advice is appreciated! Thx in advance!

Expected behavior

Loss is larger than 0.0 and the model parameter will update.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Victordongy commented 1 year ago

It seems this issue still persists. Maybe we could consider re-opening this issue. I also have the similar issue with loss being 0 after running one iteration using 8 bit or fp16, the transformer version is 4.32.0. @younesbelkada to help take a look at this issue.

My system info is as follows:

ArthurZucker commented 1 year ago

Inviting you to read #10956 which has very detailed explanation and a potential solution for you 😉

Victordongy commented 1 year ago

Inviting you to read #10956 which has very detailed explanation and a potential solution for you 😉

Hi @ArthurZucker, as quote from https://github.com/huggingface/transformers/pull/10956#issuecomment-1238724959. It sees the experimental change has not been merged, and also without too much related performance experiments. However, from this pr #20760 I noticed that the 8bit workaround is first converting partial of the modules to be fp16 with the other unchanged. I wonder this might also seem to be feasible solution for fp16 training ?