gpt neox finetuning does not work(segmentaion fault) since 1.7.0

ZhaiFeiyue commented 1 year ago

System Info

optimum-habana version >1.7.0
deepspeed 1.11.0

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

python3 /root/repos/optimum-habana/examples/gaudi_spawn.py   --hostfile /root/repos/hostsfile --world_size 8 --use_deepspeed /root/repos/optimum-habana/examples/language-modeling/run_clm.py --deepspeed /root/repos/optimum-habana/tests/configs/deepspeed_zero_2.json --model_name_or_path 'EleutherAI/gpt-neox-20b' --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --num_train_epochs 1 --do_train --output_dir ~/gpt-neox-20b --gaudi_config_name Habana/gpt2 --gradient_checkpointing --use_habana --use_lazy_mode --throughput_warmup_steps 3 --overwrite_output_dir --use_hpu_graphs_for_inference

crash log

10.233.250.163: Loading extension module utils...
10.233.250.163: [INFO|trainer.py:680] 2023-09-12 09:11:29,269 >> ***** Running training *****
10.233.250.163: [INFO|trainer.py:681] 2023-09-12 09:11:29,269 >>   Num examples = 2,334
10.233.250.163: [INFO|trainer.py:682] 2023-09-12 09:11:29,269 >>   Num Epochs = 1
10.233.250.163: [INFO|trainer.py:683] 2023-09-12 09:11:29,269 >>   Instantaneous batch size per device = 2
10.233.250.163: [INFO|trainer.py:686] 2023-09-12 09:11:29,269 >>   Total train batch size (w. parallel, distributed & accumulation) = 32
10.233.250.163: [INFO|trainer.py:687] 2023-09-12 09:11:29,269 >>   Gradient Accumulation steps = 1
10.233.250.163: [INFO|trainer.py:688] 2023-09-12 09:11:29,269 >>   Total optimization steps = 73
10.233.250.163: [INFO|trainer.py:689] 2023-09-12 09:11:29,274 >>   Number of trainable parameters = 20,554,567,680
10.233.168.102: Time to load utils op: 0.0013871192932128906 seconds
10.233.168.102: Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...
10.233.168.102: No modifications detected for re-loaded extension module utils, skipping build step...
10.233.168.102: Loading extension module utils...
10.233.168.102: Time to load utils op: 0.00066375732421875 seconds
10.233.168.102: Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...
10.233.168.102: No modifications detected for re-loaded extension module utils, skipping build step...
10.233.168.102: Loading extension module utils...
10.233.168.102: Time to load utils op: 0.0005764961242675781 seconds
10.233.168.102: Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...
10.233.168.102: No modifications detected for re-loaded extension module utils, skipping build step...
10.233.168.102: Loading extension module utils...
10.233.250.163: [2023-09-12 09:11:30,324] [INFO] [checkpointing.py:605:forward] Activation Checkpointing Information
10.233.250.163: [2023-09-12 09:11:30,324] [INFO] [checkpointing.py:606:forward] ----Partition Activations False, CPU CHECKPOINTING False
10.233.250.163: [2023-09-12 09:11:30,324] [INFO] [checkpointing.py:607:forward] ----contiguous Memory Checkpointing False with None total layers
10.233.250.163: [2023-09-12 09:11:30,324] [INFO] [checkpointing.py:609:forward] ----Synchronization False
10.233.250.163: [2023-09-12 09:11:30,324] [INFO] [checkpointing.py:610:forward] ----Profiling time in checkpointing False
10.233.168.102: Internal Error: Received signal - Segmentation fault
10.233.168.102: Internal Error: Received signal - Segmentation fault
10.233.250.163: 
  0%|          | 0/73 [00:00<?, ?it/s]Internal Error: Received signal - Segmentation fault
10.233.250.163: Internal Error: Received signal - Segmentation fault
10.233.250.163: Internal Error: Received signal - Segmentation fault
10.233.250.163: Internal Error: Received signal - Segmentation fault
10.233.250.163: Internal Error: Received signal - Segmentation fault
10.233.250.163: Internal Error: Received signal - Segmentation fault
10.233.250.163: Internal Error: Received signal - Segmentation fault
10.233.168.102: Internal Error: Received signal - Segmentation fault
10.233.168.102: Internal Error: Received signal - Segmentation fault
10.233.168.102: Internal Error: Received signal - Segmentation fault
10.233.168.102: Internal Error: Received signal - Segmentation fault
10.233.250.163: Internal Error: Received signal - Segmentation fault
10.233.168.102: Internal Error: Received signal - Segmentation fault

Expected behavior

same with 1.6.1

ZhaiFeiyue commented 1 year ago

@ankurhabana will follow this issue

asharmahabana commented 1 year ago

@regisss Please let us know if the fix would be available in 1.7.5. We need to make the synapse 1.12.0 release. Thanks!

regisss commented 1 year ago

@asharmahabana I don't have an ETA for this fix, I currently don't have the bandwidth to work on it before next week.

regisss commented 1 year ago

Okay, so quickly looking at the changes between v1.6.1 and v1.7, I suspect that this issue comes from using FusedRoPE during training. @ZhaiFeiyue @asharmahabana Could you try #410 and let me know if that works on your side? You'll need a batch size of 1 if you use a single Gaudi2 node.

We probably need the same fix for Llama, I'll add it in this PR once it's approved.

mandy-li commented 1 year ago

@regisss @ZhaiFeiyue , our customer tried llama2 finetuning without any problem. Disabling fused RoPE will cause performance drop that our customer is currently benchmarking. @schoi-habana , please help them to debug what was wrong

regisss commented 1 year ago

@mandy-li Okay, I'll enable it again then. Was this Llama fine-tuning done with DeepSpeed?

schoi-habana commented 1 year ago

@regisss yes the exact same command worked for Llama fine-tuning with DeepSpeed

mandy-li commented 1 year ago

@regisss , customer used lora finetuning for llama2 and didn't get any problem. @schoi-habana is debugging to see if DS caused the issue.

regisss commented 1 year ago

I just opened #413 to revert these changes for Llama and Falcon.

huggingface / optimum-habana