[BUG]: Run finetune with the DEMO, get CUDA Out of Memory on a H800 node in hpcaitech cloud instance

hiprince commented 3 months ago

Is there an existing issue for this bug?

[X] I have searched the existing issues

🐛 Describe the bug

I applied for an H100 instance from colossal cloud, following this document to run finetune of LLama3 https://cloud.luchentech.com/doc/docs/examples/llama/

I got this CUDA out of memory error.

 File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 230, in forward
    outputs = run_function(*args)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 917, in custom_forward
    return module(*inputs, past_key_value, output_attentions, padding_mask=padding_mask)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 649, in forward
    hidden_states = self.mlp(hidden_states)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 247, in forward
    down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
  File "/root/.local/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 91, in __torch_function__
    ret = func(*args, **kwargs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.75 GiB. GPU 0 has a total capacty of 79.14 GiB of which 1.42 GiB is free. Process 14109 has 77.71 GiB memory in use. Of the allocated memory 64.54 GiB is allocated by PyTorch, and 10.81 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Epoch 0:   0%|          | 0/3 [00:05<?, ?it/s]
[2024-06-11 15:47:31,410] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2339) of binary: /opt/conda/envs/pytorch/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/pytorch/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.2', 'console_scripts', 'torchrun')())
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

This is the commands that I use.

TRAIN_HOME=/root
cd $TRAIN_HOME/ColossalAI
BUILD_EXT=1 pip install .

mkdir $TRAIN_HOME/training_outputs
mkdir $TRAIN_HOME/training_outputs/checkpoints
mkdir $TRAIN_HOME/training_outputs/configs
mkdir $TRAIN_HOME/training_outputs/tensorboards

cd $TRAIN_HOME/ColossalAI/applications/Colossal-LLaMA/
echo '127.0.0.1' > hostfile

PROJECT_NAME="LLaMA-3-8B-cpt"
PARENT_SAVE_DIR="${TRAIN_HOME}/training_outputs/checkpoints/" # Path to a folder to save checkpoints
PARENT_TENSORBOARD_DIR="${TRAIN_HOME}/training_outputs/tensorboards/" # Path to a folder to save logs
PARENT_CONFIG_FILE="${TRAIN_HOME}/training_outputs/configs/" # Path to a folder to save training config logs
PRETRAINED_MODEL_PATH="/root/commonData/Meta-Llama-3-8B" # huggingface or local model path

# 以预置已处理数据集为例
declare -a dataset=(
    /root/commonData/tokenized-cpt-data/arrow/part-00000
    /root/commonData/tokenized-cpt-data/arrow/part-00001
    /root/commonData/tokenized-cpt-data/arrow/part-00002
)

TIMESTAMP=$(date +%Y-%m-%d-%H-%M-%S)
FULL_PROJECT_NAME="${PROJECT_NAME}-${TIMESTAMP}"
SAVE_DIR="${PARENT_SAVE_DIR}${FULL_PROJECT_NAME}"
CONFIG_FILE="${PARENT_CONFIG_FILE}${FULL_PROJECT_NAME}.json"

colossalai run --nproc_per_node 1 --hostfile hostfile --master_port 31312 train.py \
    --pretrained $PRETRAINED_MODEL_PATH \
    --dataset ${dataset[@]} \
    --plugin "zero2" \
    --save_interval 400 \
    --save_dir $SAVE_DIR \
    --tensorboard_dir $PARENT_TENSORBOARD_DIR \
    --config_file $CONFIG_FILE \
    --num_epochs 1 \
    --micro_batch_size 2 \
    --lr 1e-4 \
    --mixed_precision "bf16" \
    --grad_clip 1.0 \
    --weight_decay 0.01 \
    --warmup_steps 100 \
    --use_grad_checkpoint \
    --use_flash_attn

Environment

TongLi3701 commented 3 months ago

Hi,

For full parameter fine-tuning llama3-8B model, you will need at least 8 GPUs. Thanks.

hiprince commented 3 months ago

Hi,

I tried 4 H800, it runs. However, I got this error when saving the final checkpoints. It doesn't sounds like VRAM capacity issue.

Start saving model checkpoint with running states
Traceback (most recent call last):
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/serialization.py", line 619, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol, _disable_byteorder_record)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/serialization.py", line 853, in _save
    zip_file.write_record(name, storage.data_ptr(), num_bytes)
RuntimeError: [enforce fail at inline_container.cc:588] . PytorchStreamWriter failed writing file data/1: file write failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/ColossalAI/applications/Colossal-LLaMA/train.py", line 426, in <module>
    main()
  File "/root/ColossalAI/applications/Colossal-LLaMA/train.py", line 386, in main
    save_checkpoint(
  File "/root/ColossalAI/applications/Colossal-LLaMA/colossal_llama/utils/ckpt_io.py", line 56, in save_checkpoint
    booster.save_optimizer(optimizer, os.path.join(save_dir, "optimizer"), shard=True)
  File "/root/.local/lib/python3.9/site-packages/colossalai/booster/booster.py", line 307, in save_optimizer
    self.checkpoint_io.save_optimizer(optimizer, checkpoint, shard, gather_dtensor, prefix, size_per_shard)
  File "/root/.local/lib/python3.9/site-packages/colossalai/checkpoint_io/checkpoint_io_base.py", line 197, in save_optimizer
    self.save_sharded_optimizer(optimizer, checkpoint, gather_dtensor, prefix, size_per_shard)
  File "/root/.local/lib/python3.9/site-packages/colossalai/booster/plugin/low_level_zero_plugin.py", line 140, in save_sharded_optimizer
    save_state_dict(shard, checkpoint_file_path, use_safetensors=False)
  File "/root/.local/lib/python3.9/site-packages/colossalai/checkpoint_io/utils.py", line 328, in save_state_dict
    torch.save(state_dict_cpu, checkpoint_file_path)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/serialization.py", line 620, in save
    return
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/serialization.py", line 466, in __exit__
    self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:424] . unexpected pos 234881728 vs 234881616
Epoch 0: 100%|██████████| 30/30 [02:53<00:00,  5.77s/it, Loss=0.8085]
[2024-06-12 13:49:44,094] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 124 closing signal SIGTERM
[2024-06-12 13:49:44,094] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 125 closing signal SIGTERM
[2024-06-12 13:49:44,095] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 126 closing signal SIGTERM
[2024-06-12 13:49:45,026] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 123) of binary: /opt/conda/envs/pytorch/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/pytorch/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.2', 'console_scripts', 'torchrun')())
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main 
    run(args)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-12_13:49:44
  host      : notebook-f355a3d5-1612-4d34-bf27-53bfb3de17d4-0.notebook-f355a3d5-1612-4d34-bf27-53bfb3de17d4.colossal-ai.svc.cluster.local

hiprince commented 3 months ago

emmm, the /root is full.

(pytorch) root@notebook-f355a3d5-1612-4d34-bf27-53bfb3de17d4-0:~/ColossalAI/applications/Colossal-LLaMA# df -h /root
Filesystem                                                      Size  Used Avail Use% Mounted on
/dev/mapper/nvme-pvc--65c2fb0a--4125--4acd--842a--cd22388cf10a   49G   49G     0 100% /root

TongLi3701 commented 3 months ago

I see. You will need more space to save checkpoints. If you want to save intermediate results include optimizer, you will need more.

hpcaitech / ColossalAI