Error occured when i train the model

littletomatodonkey commented 1 month ago

Hi, thanks for your great job! I want to reproduce the training process but some error occured as follows. Could you please help to have a look? Thanks!

Training scripts (I just have 4xA100, so the node num is changed to 4 in train_cllm.sh)

model path: download from the huggingface according to the doc
trajectory_file: download from : https://huggingface.co/datasets/cllm/cleaned_gsm8k_jacobi_max_new_tokens16_augTrue_labels_True_max_seq_len_512

model_path="/mnt/bn/multimodel/models/official/cllm/cllm--vicuna-7b-sharegpt-gpt4-48k/model"
trajectory_file="data/collected_jacobi_trajectory/cleaned_gsm8k_jacobi_max_new_tokens16_augTrue_labels_True_max_seq_len_512.json"
output_path="./output_baseline"
n_token_seq_size=512

bash scripts/train_cllm.sh ${model_path} ${trajectory_file} ${output_path} ${n_token_seq_size}

The errors are as follows.

Traceback (most recent call last):
  File "/mnt/bn/multimodel/code/Consistency_LLM/cllm/train_cllm_global.py", line 289, in <module>
    train()
  File "/mnt/bn/multimodel/code/Consistency_LLM/cllm/train_cllm_global.py", line 281, in train
    trainer.train()
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/transformers/trainer.py", line 1821, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/accelerate/data_loader.py", line 448, in __iter__
    current_batch = next(dataloader_iter)
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 265, in default_collate
    return collate(batch, collate_fn_map=default_collate_fn_map)
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 127, in collate
    return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 127, in <dictcomp>
    return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 138, in collate
    raise RuntimeError('each element in list of batch should be of equal size')
RuntimeError: each element in list of batch should be of equal size

littletomatodonkey commented 1 month ago

I tested on another 8 gpu device and meet the same error.

littletomatodonkey commented 1 month ago

When I train with batch size = 1 on single A100 (80G), it told me out of memory. Do I need set other configs? Thanks!

export CUDA_VISIBLE_DEVICES=0
export WANDB_PROJECT=consistency_llm

model_path="/mnt/bn/multimodel/models/official/cllm/GAIR--Abel-7B-001/model"
trajectory_file="data/collected_jacobi_trajectory/my_cleaned_gsm8k_jacobi_max_new_tokens16_augTrue_labels_True_max_seq_len_512.json"

output_path="./output_baseline"
n_token_seq_size=512

torchrun --nnodes=1 --nproc_per_node=1 --rdzv_id=101 --rdzv_endpoint='localhost:5666' \
    --master_port 10000 \
    cllm/train_cllm_global.py \
    --target_model_path ${model_path} \
    --data_path ${trajectory_file} \
    --output_dir ${output_path} \
    --max_new_tokens ${n_token_seq_size} \
    --bf16 True \
    --tf32 True \
    --report_to wandb \
    --do_train \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --gradient_checkpointing True \
    --evaluation_strategy "epoch" \
    --save_strategy "steps" \
    --save_steps 100 \
    --save_total_limit 50 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 10 \
    --model_max_length 2048 \
    --lazy_preprocess True \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'

Error info

  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/optim/adamw.py", line 173, in step
    self._init_group(
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/optim/adamw.py", line 125, in _init_group
    state["exp_avg_sq"] = torch.zeros_like(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacty of 79.35 GiB of which 158.19 MiB is free. Process 2239837 has 79.19 GiB memory in use. Of the allocated memory 78.24 GiB is allocated by PyTorch, and 305.59 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

snyhlxde1 commented 1 month ago

Hi thank you for your interest in our work!

I checked your bash script commands, the n_token_seq_size should be set to 16 (notice that n_token_seq_size is the sub-sequence length that used for Jacobi iteration, 512 should be the max output sequence length used during the Jacobi trajectory collection process. The two arguments are different). And the prepared Jacobi dataset you download is formatted to support batch size = 1 training only. For batch_size > 1: you need to generate your own Jacobi datasets with batch size > 1 or some data pre-processing with the dataset to train with batch size > 1.

We have also updated the example training script accordingly.

Notice that the provided Jacobi trajectory file reads: cleaned_gsm8k_jacobi_max_new_tokens16_augTrue_labels_True_max_seq_len_512 which can be interpreted as:

it has been post processed to remove repetitive generation content so flagged as 'cleaned', and data augmentation is turned on. (see paper data cleaning section and Jacobi trajectory generation script)
n_token_seq_size = 16 (max_new_tokens) used during Jacobi trajectory collection process.
model_max_length = 512 (max_seq_len) used during Jacobi trajectory collection process.

For the OOM issue, please use more than 1 A100 80G GPU :)

littletomatodonkey commented 1 month ago

Hi thank you for your interest in our work!

I checkout your bash script command, the n_token_seq_size should be set to 16 (notice that n_token_seq_size is the sub-sequence length that used for Jacobi iteration, 512 should be the max output sequence length used during the Jacobi trajectory collection process. The two arguments are different). And the prepared Jacobi dataset you download is formatted to support batch size = 1 training only. For batch_size > 1: you need to generate your own Jacobi datasets with batch size > 1 or some data pre-processing with the dataset to train with batch size > 1.

We have also updated the example training script accordingly.

Notice that the provided Jacobi trajectory file reads: cleaned_gsm8k_jacobi_max_new_tokens16_augTrue_labels_True_max_seq_len_512 which can be interpreted as:

it has been post processed to remove repetitive generation content so flagged as 'cleaned', and data augmentation is turned on. (see paper data cleaning section and Jacobi trajectory generation script)

n_token_seq_size = 16 (max_new_tokens) used during Jacobi trajectory collection process.

model_max_length = 512 (max_seq_len) used during Jacobi trajectory collection process.

For the OOM issue, please use more than 1 A100 80G GPU :)

Thanks for your reply, i tried using 4-cards training with n_token_seq_size=16 and it can train normally. For the larger bs training, i'll take a look, would you consider providing a script to deal with the multi samples' training per-batch? Thanks !

snyhlxde1 commented 1 month ago

Dealing with multi-sample training per batch would require some modifications to the Jacobi trajectory preparation script as well as minor modifications to data preprocessing in cllm/train_cllm_global.py script, or post-processing with the current version of Jacobi dataset so that each data entry can be collated into a batch (require removing redundant dimensionality for the collected token ids etc.). Feel free to look into, give it a try and follow up on this thread. I would love to help out. If there is enough interest, we will update the scripts accordingly to automate this process.

hao-ai-lab / Consistency_LLM

Error occured when i train the model #9