Closed hiprince closed 3 months ago
Hi,
For full parameter fine-tuning llama3-8B model, you will need at least 8 GPUs. Thanks.
Hi,
I tried 4 H800, it runs. However, I got this error when saving the final checkpoints. It doesn't sounds like VRAM capacity issue.
Start saving model checkpoint with running states
Traceback (most recent call last):
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/serialization.py", line 619, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol, _disable_byteorder_record)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/serialization.py", line 853, in _save
zip_file.write_record(name, storage.data_ptr(), num_bytes)
RuntimeError: [enforce fail at inline_container.cc:588] . PytorchStreamWriter failed writing file data/1: file write failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/ColossalAI/applications/Colossal-LLaMA/train.py", line 426, in <module>
main()
File "/root/ColossalAI/applications/Colossal-LLaMA/train.py", line 386, in main
save_checkpoint(
File "/root/ColossalAI/applications/Colossal-LLaMA/colossal_llama/utils/ckpt_io.py", line 56, in save_checkpoint
booster.save_optimizer(optimizer, os.path.join(save_dir, "optimizer"), shard=True)
File "/root/.local/lib/python3.9/site-packages/colossalai/booster/booster.py", line 307, in save_optimizer
self.checkpoint_io.save_optimizer(optimizer, checkpoint, shard, gather_dtensor, prefix, size_per_shard)
File "/root/.local/lib/python3.9/site-packages/colossalai/checkpoint_io/checkpoint_io_base.py", line 197, in save_optimizer
self.save_sharded_optimizer(optimizer, checkpoint, gather_dtensor, prefix, size_per_shard)
File "/root/.local/lib/python3.9/site-packages/colossalai/booster/plugin/low_level_zero_plugin.py", line 140, in save_sharded_optimizer
save_state_dict(shard, checkpoint_file_path, use_safetensors=False)
File "/root/.local/lib/python3.9/site-packages/colossalai/checkpoint_io/utils.py", line 328, in save_state_dict
torch.save(state_dict_cpu, checkpoint_file_path)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/serialization.py", line 620, in save
return
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/serialization.py", line 466, in __exit__
self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:424] . unexpected pos 234881728 vs 234881616
Epoch 0: 100%|โโโโโโโโโโ| 30/30 [02:53<00:00, 5.77s/it, Loss=0.8085]
[2024-06-12 13:49:44,094] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 124 closing signal SIGTERM
[2024-06-12 13:49:44,094] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 125 closing signal SIGTERM
[2024-06-12 13:49:44,095] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 126 closing signal SIGTERM
[2024-06-12 13:49:45,026] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 123) of binary: /opt/conda/envs/pytorch/bin/python
Traceback (most recent call last):
File "/opt/conda/envs/pytorch/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.1.2', 'console_scripts', 'torchrun')())
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-06-12_13:49:44
host : notebook-f355a3d5-1612-4d34-bf27-53bfb3de17d4-0.notebook-f355a3d5-1612-4d34-bf27-53bfb3de17d4.colossal-ai.svc.cluster.local
emmm, the /root is full.
(pytorch) root@notebook-f355a3d5-1612-4d34-bf27-53bfb3de17d4-0:~/ColossalAI/applications/Colossal-LLaMA# df -h /root
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/nvme-pvc--65c2fb0a--4125--4acd--842a--cd22388cf10a 49G 49G 0 100% /root
I see. You will need more space to save checkpoints. If you want to save intermediate results include optimizer, you will need more.
Is there an existing issue for this bug?
๐ Describe the bug
I applied for an H100 instance from colossal cloud, following this document to run finetune of LLama3 https://cloud.luchentech.com/doc/docs/examples/llama/
I got this CUDA out of memory error.
This is the commands that I use.
Environment