Closed echo-yi closed 1 month ago
TrainingArguments
should initialize before model init.from_pretrained
method will init and split model first when using zero-3. Refer to https://huggingface.co/docs/transformers/main/en/deepspeed#zero-configuration
I also tried initializing TrainingArguments
first like below, but nothing's changed.
training_args = TrainingArguments(
output_dir=output_dir,
max_steps=max_steps,
num_train_epochs=num_train_epochs,
logging_steps=logging_steps,
eval_steps=eval_steps,
save_steps=save_steps,
evaluation_strategy='steps',
per_device_train_batch_size=per_device_train_batch_size,
per_device_eval_batch_size=per_device_eval_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
eval_accumulation_steps=gradient_accumulation_steps,
gradient_checkpointing=gradient_checkpointing,
learning_rate=learning_rate,
lr_scheduler_type=lr_scheduler_type,
warmup_ratio=warmup_ratio,
weight_decay=weight_decay,
# optim=optim, # You are using ZeRO with an untested optimizer
bf16=bf16,
remove_unused_columns=remove_unused_columns,
run_name=run_name,
report_to=report_to,
ddp_find_unused_parameters=False, # RuntimeError: Expected to mark a variable ready only once.
ddp_timeout=72000, # RuntimeError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-8B-Instruct",
quantization_config=bnb_config,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
)
Well in my case that works :) so I am not sure whether if certain config causes your bug.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
triner.train()
throwsRuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!
, which is confusing because I thought ZeRO3 partitons model parameters, so it would be natural to have tensors across different devices.pretrain.py
zero3_config.yaml
command
accelerate launch --config_file zero3_config.yaml pretrain.py --num_processes=2 --multi_gpu
To be precise, I'm running this command with kubeflow, soversions
error log
Expected behavior
train the model without the error above