Open littletomatodonkey opened 1 month ago
I tested on another 8 gpu device and meet the same error.
When I train with batch size = 1
on single A100 (80G)
, it told me out of memory. Do I need set other configs? Thanks!
export CUDA_VISIBLE_DEVICES=0
export WANDB_PROJECT=consistency_llm
model_path="/mnt/bn/multimodel/models/official/cllm/GAIR--Abel-7B-001/model"
trajectory_file="data/collected_jacobi_trajectory/my_cleaned_gsm8k_jacobi_max_new_tokens16_augTrue_labels_True_max_seq_len_512.json"
output_path="./output_baseline"
n_token_seq_size=512
torchrun --nnodes=1 --nproc_per_node=1 --rdzv_id=101 --rdzv_endpoint='localhost:5666' \
--master_port 10000 \
cllm/train_cllm_global.py \
--target_model_path ${model_path} \
--data_path ${trajectory_file} \
--output_dir ${output_path} \
--max_new_tokens ${n_token_seq_size} \
--bf16 True \
--tf32 True \
--report_to wandb \
--do_train \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--gradient_checkpointing True \
--evaluation_strategy "epoch" \
--save_strategy "steps" \
--save_steps 100 \
--save_total_limit 50 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 10 \
--model_max_length 2048 \
--lazy_preprocess True \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'
Error info
File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
ret = func(self, *args, **kwargs)
File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/optim/adamw.py", line 173, in step
self._init_group(
File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/optim/adamw.py", line 125, in _init_group
state["exp_avg_sq"] = torch.zeros_like(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacty of 79.35 GiB of which 158.19 MiB is free. Process 2239837 has 79.19 GiB memory in use. Of the allocated memory 78.24 GiB is allocated by PyTorch, and 305.59 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Hi thank you for your interest in our work!
I checked your bash script commands, the n_token_seq_size
should be set to 16 (notice that n_token_seq_size
is the sub-sequence length that used for Jacobi iteration, 512 should be the max output sequence length used during the Jacobi trajectory collection process. The two arguments are different). And the prepared Jacobi dataset you download is formatted to support batch size = 1
training only. For batch_size > 1
:
you need to generate your own Jacobi datasets with batch size > 1 or some data pre-processing with the dataset to train with batch size > 1.
We have also updated the example training script accordingly.
Notice that the provided Jacobi trajectory file reads:
cleaned_gsm8k_jacobi_max_new_tokens16_augTrue_labels_True_max_seq_len_512
which can be interpreted as:
n_token_seq_size = 16
(max_new_tokens) used during Jacobi trajectory collection process.model_max_length = 512
(max_seq_len) used during Jacobi trajectory collection process.For the OOM issue, please use more than 1 A100 80G GPU :)
Hi thank you for your interest in our work!
I checkout your bash script command, the
n_token_seq_size
should be set to 16 (notice thatn_token_seq_size
is the sub-sequence length that used for Jacobi iteration, 512 should be the max output sequence length used during the Jacobi trajectory collection process. The two arguments are different). And the prepared Jacobi dataset you download is formatted to supportbatch size = 1
training only. Forbatch_size > 1
: you need to generate your own Jacobi datasets with batch size > 1 or some data pre-processing with the dataset to train with batch size > 1.We have also updated the example training script accordingly.
Notice that the provided Jacobi trajectory file reads:
cleaned_gsm8k_jacobi_max_new_tokens16_augTrue_labels_True_max_seq_len_512
which can be interpreted as:
- it has been post processed to remove repetitive generation content so flagged as 'cleaned', and data augmentation is turned on. (see paper data cleaning section and Jacobi trajectory generation script)
n_token_seq_size = 16
(max_new_tokens) used during Jacobi trajectory collection process.model_max_length = 512
(max_seq_len) used during Jacobi trajectory collection process.For the OOM issue, please use more than 1 A100 80G GPU :)
Thanks for your reply, i tried using 4-cards training with n_token_seq_size=16
and it can train normally.
For the larger bs training, i'll take a look, would you consider providing a script to deal with the multi samples' training per-batch? Thanks !
Dealing with multi-sample training per batch would require some modifications to the Jacobi trajectory preparation script as well as minor modifications to data preprocessing in cllm/train_cllm_global.py
script, or post-processing with the current version of Jacobi dataset so that each data entry can be collated into a batch (require removing redundant dimensionality for the collected token ids etc.). Feel free to look into, give it a try and follow up on this thread. I would love to help out.
If there is enough interest, we will update the scripts accordingly to automate this process.
Hi, thanks for your great job! I want to reproduce the training process but some error occured as follows. Could you please help to have a look? Thanks!
Training scripts (I just have 4xA100, so the node num is changed to 4 in
train_cllm.sh
)The errors are as follows.