togethercomputer / OpenChatKit

Apache License 2.0
9.01k stars 1.01k forks source link

When use one Gpu do model training, met one issue. #135

Open yxy123 opened 1 year ago

yxy123 commented 1 year ago

bash training/finetune_RedPajama-INCITE-Chat-3B-v1.sh My configurations changes as below: --lr 1e-5 --seq-length 2048 --batch-size 8 --micro-batch-size 1 --gradient-accumulate-step 1 \ --num-layers 2 --embedding-dim 2560 \ --world-size 1 --pipeline-group-size 1 --data-group-size 1 \ (trap 'kill 0' SIGINT; \ python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 0 --rank 0 \ & \ My environment: GPU: NVIDIA RTX A4000 Graphics card memory:16 GB Number of CPUs available for use: 8 Memory : 60 GB Free space: 200 GB

error log: Rank 0 node forward pass 0/1 takes 1.84s {'loss': 16.892578125, 'lr': 1e-05} Rank 0 node backward pass 0/1 takes 1.09s cuda:0 cuda:0 cuda:0 !!! Warning: find inf in fp16 optimizer-step() !!! /root/miniconda3/envs/myconda/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:138: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate warnings.warn("Detected call of lr_scheduler.step() before optimizer.step(). " after cuda sync 0 Rank 0 node optimizer step takes 0.07s Rank 0 node whole iteration takes 3.00s

Rank 0 node forward pass 0/1 takes 0.36s {'loss': 16.744140625, 'lr': 9e-06}

With this error log, it seems also produce one model locate in model_ckpts/redpajama-incite-chat-3b-sample/checkpoint_10/, I don't if it's model is right?

yxy123 commented 1 year ago

And then execute converting weights to Huggingface format, met the dimension and target size of the extension operation do not match. (myconda) root@ZaodV6:/mnt/tet/OpenChatKit-main# mkdir huggingface_models \

&& python tools/convert_to_hf_gptneox.py \ --config-name EleutherAI/pythia-6.9b-deduped \ --ckpt-path model_ckpts/redpajama-incite-chat-3b-sample/checkpoint_10 \ --save-path huggingface_models/Pythia-Chat-Base-3B \ --n-stages 4 \ --n-layer-per-stage 8 \ --fp16 loading config... loaded config. loading tokenizer... loaded tokenizer. creating empty model... created empty model. loading model ckpt... loading stage 0 Traceback (most recent call last): File "tools/convert_to_hf_gptneox.py", line 123, in load_decentralized_checkpoint( File "tools/convert_to_hf_gptneox.py", line 48, in load_decentralized_checkpoint model.gpt_neox.embed_in.weight.data[:] = _tmp['embed_in.weight'] RuntimeError: The expanded size of the tensor (4096) must match the existing size (2560) at non-singleton dimension 1. Target sizes: [50432, 4096]. Tensor sizes: [50432, 2560]

ChengYen-Tang commented 1 year ago

bash training/finetune_RedPajama-INCITE-Chat-3B-v1.sh My configurations changes as below: --lr 1e-5 --seq-length 2048 --batch-size 8 --micro-batch-size 1 --gradient-accumulate-step 1 --num-layers 2 --embedding-dim 2560 --world-size 1 --pipeline-group-size 1 --data-group-size 1 (trap 'kill 0' SIGINT; python DIR/distclmtrain.py(echo ${ARGS}) --cuda-id 0 --rank 0 & My environment: GPU: NVIDIA RTX A4000 Graphics card memory:16 GB Number of CPUs available for use: 8 Memory : 60 GB Free space: 200 GB

error log:

Rank 0 node forward pass 0/1 takes 1.84s {'loss': 16.892578125, 'lr': 1e-05} Rank 0 node backward pass 0/1 takes 1.09s cuda:0 cuda:0 cuda:0 !!! Warning: find inf in fp16 optimizer-step() !!! /root/miniconda3/envs/myconda/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:138: UserWarning: Detected call of before . In PyTorch 1.1.0 and later, you should call them in the opposite order: before . Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate warnings.warn("Detected call of before . " after cuda sync 0 Rank 0 node optimizer step takes 0.07s Rank 0 node whole iteration takes 3.00slr_scheduler.step()``optimizer.step()``optimizer.step()``lr_scheduler.step()``lr_scheduler.step()``optimizer.step() Rank 0 node forward pass 0/1 takes 0.36s {'loss': 16.744140625, 'lr': 9e-06}

With this error log, it seems also produce one model locate in model_ckpts/redpajama-incite-chat-3b-sample/checkpoint_10/, I don't if it's model is right?

This model has 32 layers, so if you only have one GPU, --num-layers must be 32 https://huggingface.co/togethercomputer/RedPajama-INCITE-Chat-3B-v1/blob/main/config.json#L16

ChengYen-Tang commented 1 year ago

And then execute converting weights to Huggingface format, met the dimension and target size of the extension operation do not match. (myconda) root@ZaodV6:/mnt/tet/OpenChatKit-main# mkdir huggingface_models \

&& python tools/convert_to_hf_gptneox.py --config-name EleutherAI/pythia-6.9b-deduped --ckpt-path model_ckpts/redpajama-incite-chat-3b-sample/checkpoint_10 --save-path huggingface_models/Pythia-Chat-Base-3B --n-stages 4 --n-layer-per-stage 8 --fp16 loading config... loaded config. loading tokenizer... loaded tokenizer. creating empty model... created empty model. loading model ckpt... loading stage 0 Traceback (most recent call last): File "tools/convert_to_hf_gptneox.py", line 123, in load_decentralized_checkpoint( File "tools/convert_to_hf_gptneox.py", line 48, in load_decentralized_checkpoint model.gpt_neox.embed_in.weight.data[:] = _tmp['embed_in.weight'] RuntimeError: The expanded size of the tensor (4096) must match the existing size (2560) at non-singleton dimension 1. Target sizes: [50432, 4096]. Tensor sizes: [50432, 2560]

https://github.com/togethercomputer/OpenChatKit/issues/86#issuecomment-1667192363