Open premmotgi opened 1 month ago
When I use the k8s sample example for lora for llama3 8B model it works fine. But for 70b model it fails with OOM. Total number of GPUs: 8 x Gaudi3 GPUs Dataset: databricks-dolly-15k Error: wnloading shards: 100%|██████████| 30/30 [1:32:34<00:00, 185.16s/it] Downloading shards: 100%|██████████| 30/30 [1:32:34<00:00, 185.16s/it] Downloading shards: 100%|██████████| 30/30 [1:32:34<00:00, 185.16s/it] Downloading shards: 100%|██████████| 30/30 [1:32:34<00:00, 185.16s/it] Downloading shards: 100%|██████████| 30/30 [1:32:34<00:00, 185.16s/it] Downloading shards: 100%|██████████| 30/30 [1:32:34<00:00, 185.16s/it] Loading checkpoint shards: 100%|██████████| 30/30 [00:02<00:00, 12.13it/s] Loading checkpoint shards: 100%|██████████| 30/30 [00:04<00:00, 6.99it/s] Loading checkpoint shards: 100%|██████████| 30/30 [00:04<00:00, 7.43it/s] Loading checkpoint shards: 100%|██████████| 30/30 [00:03<00:00, 7.51it/s] Loading checkpoint shards: 100%|██████████| 30/30 [00:03<00:00, 7.53it/s] Loading checkpoint shards: 100%|██████████| 30/30 [00:04<00:00, 6.27it/s] gaudi-llm-ds-ft-worker-0: [2024-09-11 23:01:54,538] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2831 gaudi-llm-ds-ft-worker-0: [2024-09-11 23:01:54,700] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2832 Map: 100%|██████████| 14411/14411 [00:04<00:00, 3323.03 examples/s] Map: 100%|██████████| 14411/14411 [00:04<00:00, 3233.18 examples/s] Map: 100%|██████████| 14411/14411 [00:04<00:00, 3234.02 examples/s] Map: 100%|██████████| 14411/14411 [00:04<00:00, 3215.13 examples/s] Map: 100%|██████████| 14411/14411 [00:04<00:00, 3205.00 examples/s] Map: 100%|██████████| 600/600 [00:00<00:00, 4528.23 examples/s] Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 9.71MB/s] Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 23.9MB/s] Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 19.7MB/s] Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 22.7MB/s] Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 21.8MB/s] gaudi-llm-ds-ft-worker-0: [2024-09-11 23:02:10,224] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2833 gaudi-llm-ds-ft-worker-0: trainable params: 16,384,000 || all params: 70,570,090,496 || trainable%: 0.0232 gaudi-llm-ds-ft-worker-0: trainable params: 16,384,000 || all params: 70,570,090,496 || trainable%: 0.0232 gaudi-llm-ds-ft-worker-0: trainable params: 16,384,000 || all params: 70,570,090,496 || trainable%: 0.0232 gaudi-llm-ds-ft-worker-0: trainable params: 16,384,000 || all params: 70,570,090,496 || trainable%: 0.0232 gaudi-llm-ds-ft-worker-0: [2024-09-11 23:02:24,041] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2834 gaudi-llm-ds-ft-worker-0: [2024-09-11 23:02:39,007] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2835 gaudi-llm-ds-ft-worker-0: [2024-09-11 23:02:39,008] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2836 gaudi-llm-ds-ft-worker-0: [rank7]: Traceback (most recent call last): gaudi-llm-ds-ft-worker-0: [rank7]: File "/optimum-habana/examples/language-modeling/run_lora_clm.py", line 935, in <module> gaudi-llm-ds-ft-worker-0: [rank7]: main() gaudi-llm-ds-ft-worker-0: [rank7]: File "/optimum-habana/examples/language-modeling/run_lora_clm.py", line 891, in main gaudi-llm-ds-ft-worker-0: [rank7]: trainer = GaudiTrainer( gaudi-llm-ds-ft-worker-0: [rank7]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 216, in __init__ gaudi-llm-ds-ft-worker-0: [rank7]: super().__init__( gaudi-llm-ds-ft-worker-0: [rank7]: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 535, in __init__ gaudi-llm-ds-ft-worker-0: [rank7]: self._move_model_to_device(model, args.device) gaudi-llm-ds-ft-worker-0: [rank7]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 299, in _move_model_to_device gaudi-llm-ds-ft-worker-0: [rank7]: model = model.to(device) gaudi-llm-ds-ft-worker-0: [rank7]: File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 179, in wrapped_to gaudi-llm-ds-ft-worker-0: [rank7]: result = self.original_to(*args, **kwargs) gaudi-llm-ds-ft-worker-0: [rank7]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1176, in to gaudi-llm-ds-ft-worker-0: [rank7]: return self._apply(convert) gaudi-llm-ds-ft-worker-0: [rank7]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 779, in _apply gaudi-llm-ds-ft-worker-0: [rank7]: module._apply(fn) gaudi-llm-ds-ft-worker-0: [rank7]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 779, in _apply gaudi-llm-ds-ft-worker-0: [rank7]: module._apply(fn) gaudi-llm-ds-ft-worker-0: [rank7]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 779, in _apply gaudi-llm-ds-ft-worker-0: [rank7]: module._apply(fn) gaudi-llm-ds-ft-worker-0: [rank7]: [Previous line repeated 4 more times] gaudi-llm-ds-ft-worker-0: [rank7]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 804, in _apply gaudi-llm-ds-ft-worker-0: [rank7]: param_applied = fn(param) gaudi-llm-ds-ft-worker-0: [rank7]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1162, in convert gaudi-llm-ds-ft-worker-0: [rank7]: return t.to( gaudi-llm-ds-ft-worker-0: [rank7]: File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 57, in __torch_function__ gaudi-llm-ds-ft-worker-0: [rank7]: return super().__torch_function__(func, types, new_args, kwargs) gaudi-llm-ds-ft-worker-0: [rank7]: RuntimeError: [Rank:7] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::469762048 (448)MB gaudi-llm-ds-ft-worker-0: Internal Error: Received signal - Segmentation fault gaudi-llm-ds-ft-worker-0: [rank6]: Traceback (most recent call last): gaudi-llm-ds-ft-worker-0: [rank6]: File "/optimum-habana/examples/language-modeling/run_lora_clm.py", line 935, in <module> gaudi-llm-ds-ft-worker-0: [rank6]: main() gaudi-llm-ds-ft-worker-0: [rank6]: File "/optimum-habana/examples/language-modeling/run_lora_clm.py", line 891, in main gaudi-llm-ds-ft-worker-0: [rank6]: trainer = GaudiTrainer( gaudi-llm-ds-ft-worker-0: [rank6]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 216, in __init__ gaudi-llm-ds-ft-worker-0: [rank6]: super().__init__( gaudi-llm-ds-ft-worker-0: [rank6]: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 535, in __init__ gaudi-llm-ds-ft-worker-0: [rank6]: self._move_model_to_device(model, args.device) gaudi-llm-ds-ft-worker-0: [rank6]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 299, in _move_model_to_device gaudi-llm-ds-ft-worker-0: [rank6]: model = model.to(device) gaudi-llm-ds-ft-worker-0: [rank6]: File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 179, in wrapped_to gaudi-llm-ds-ft-worker-0: [rank6]: result = self.original_to(*args, **kwargs) gaudi-llm-ds-ft-worker-0: [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1176, in to gaudi-llm-ds-ft-worker-0: [rank6]: return self._apply(convert) gaudi-llm-ds-ft-worker-0: [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 779, in _apply gaudi-llm-ds-ft-worker-0: [rank6]: module._apply(fn) gaudi-llm-ds-ft-worker-0: [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 779, in _apply gaudi-llm-ds-ft-worker-0: [rank6]: module._apply(fn) gaudi-llm-ds-ft-worker-0: [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 779, in _apply gaudi-llm-ds-ft-worker-0: [rank6]: module._apply(fn) gaudi-llm-ds-ft-worker-0: [rank6]: [Previous line repeated 4 more times] gaudi-llm-ds-ft-worker-0: [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 804, in _apply gaudi-llm-ds-ft-worker-0: [rank6]: param_applied = fn(param) gaudi-llm-ds-ft-worker-0: [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1162, in convert gaudi-llm-ds-ft-worker-0: [rank6]: return t.to( gaudi-llm-ds-ft-worker-0: [rank6]: File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 57, in __torch_function__ gaudi-llm-ds-ft-worker-0: [rank6]: return super().__torch_function__(func, types, new_args, kwargs) gaudi-llm-ds-ft-worker-0: [rank6]: RuntimeError: [Rank:6] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::469762048 (448)MB gaudi-llm-ds-ft-worker-0: Internal Error: Received signal - Segmentation fault
examples
Succesfully run model customization of llama3.1 70b model.
@premmotgi If I understand correctly, you're tring to replicate https://github.com/huggingface/optimum-habana/blob/main/examples/kubernetes/ci/multi-card-lora-clm-values.yaml with Llama 3.1 70B right?
System Info
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Succesfully run model customization of llama3.1 70b model.