Closed sammed-kamboj closed 1 month ago
Just use the Trainer
, you don't need to wrap it under accelerate.prepare
. This likely may be some of the issue. (it uses accelerate
under the hood)
@muellerzr Thank you for your reply! I removed the wrapping of Trainer in accelerate, but it is showing me the same error. To give more information, I have 2 machines on cloud with 4 V100 GPUs on each node. Each GPU is 32 GB. My data has max token length of 4000 so I am trying to run with max seq length of 4096.
fsdp_use_orig_params
argument while setting up accelerate config. Whenever I tried to set it up True
, it was giving the above error. When I set it to false
, and then the script threw out of memory error. LoRA
+ fsdp_use_orig_params=True
+accelerate
, again the same error.
ValueError: Inconsistent compute device and device_id on rank 0: cuda:1 vs cuda:0
LoRA
+fsdp_use_orig_params=False
+accelerate
and it threw me OOM error. Is it strange given my 8GPUs of 32GB each and with LoRA trainable parameters are only 10M. This is what I can see on my terminal trainable params: 10,223,616 || all params: 8,040,484,864 || trainable%: 0.1272
deepspeed
stage 3 in accelerate config, but again was getting OOM for fp16. (Had to remove device_map while loading the model, since I got an error saying deepspeed is not compatible with device map) LoRA
+ deepspeed
and launched it with accelerate
, and I was getting some NCCL network errors. LoRA
+fsdp_use_orig_params=false
with following accelerate config and the model is training. I can see all the GPUs being used till 18GB/32GB, on both the clusters, but I am not sure about the original error while using fsdp_use_orig_params = true
,compute_environment: LOCAL_MACHINE debug: true distributed_type: FSDP downcast_bf16: 'no' enable_cpu_affinity: false fsdp_config: fsdp_activation_checkpointing: false fsdp_auto_wrap_policy: SIZE_BASED_WRAP fsdp_backward_prefetch: BACKWARD_PRE fsdp_cpu_ram_efficient_loading: true fsdp_forward_prefetch: true fsdp_min_num_params: 10000000 fsdp_offload_params: true fsdp_sharding_strategy: FULL_SHARD fsdp_state_dict_type: SHARDED_STATE_DICT fsdp_sync_module_states: true fsdp_use_orig_params: false machine_rank: 0 main_process_ip: 15***** main_process_port: 29500 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 8 rdzv_backend: static same_network: false tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
I encountered a similar issue trying to fine-tune Llama 3.1 70B using FSDP. I fixed the issue by setting device_map="cpu"
when calling AutoModelForCausalLM.from_pretrained
. The FSDP wrap seems to bring it onto GPU and shard the model. If the model were small enough to fit on one GPU (ex. if you loaded the model in bf16 precision), you could also try device_map=f"cuda:{int(os.environ.get("LOCAL_RANK", 0))}"
. My guess for the error is the FSDP wrap doesn't like getting a model sharded across GPUs as input, which is what device_map="auto"
will do.
@wizeng23 Your suggestion solved my problem! Thank you!
Hello,
I am trying to finetune a llama3.1 on my custom dataset. I have access to a 2 nodes cluster with 4 gpus on each cluster. I am pretty new to finetuning on a multi node cluster. With whatever info I could find online, I ran the following code:
model = AutoModelForCausalLM.from_pretrained( "meta-llama/Meta-Llama-3.1-8B-Instruct", device_map="auto" )
model.config.pad_token_id = model.config.eos_token_id model.train() # model in training mode (dropout modules are activated)
enable gradient check pointing
model.gradient_checkpointing_enable() model.config.use_cache = False
tokenizer = AutoTokenizer.from_pretrained( "meta-llama/Meta-Llama-3.1-8B-Instruct", )
tokenizer.pad_token = tokenizer.eos_token
my_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm =False)
trainer = accelerator.prepare(Trainer( model=model, args=training_args, data_collator=my_collator, train_dataset=train_dataset, eval_dataset=val_dataset, tokenizer=tokenizer, ))
trainer.train()
I do load my personal dataset above the given code and everything is running smooth, but after loading in the model the script returns this error:
rank0: Traceback (most recent call last): rank0: File "train.py", line 157, in
rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1938, in train rank0: return inner_training_loop( rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2085, in _inner_training_loop rank0: self.model = self.accelerator.prepare(self.model) rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1311, in prepare rank0: result = tuple( rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1312, in
rank0: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1188, in _prepare_one
rank0: return self.prepare_model(obj, device_placement=device_placement)
rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1485, in prepare_model
rank0: model = FSDP(model, **kwargs)
rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 483, in init
rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 102, in _auto_wrap rank0: _recursive_wrap(recursive_wrap_kwargs, root_kwargs) # type: ignorearg-type: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 544, in _recursive_wrap rank0: wrapped_child, num_wrapped_params = _recursive_wrap( rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 544, in _recursive_wrap rank0: wrapped_child, num_wrapped_params = _recursive_wrap( rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 544, in _recursive_wrap rank0: wrapped_child, num_wrapped_params = _recursive_wrap( rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 562, in _recursive_wrap rank0: return _wrap(module, wrapper_cls, kwargs), nonwrapped_numel rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 491, in _wrap rank0: return wrapper_cls(module, kwargs) rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 509, in init
rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 587, in _init_param_handle_from_module rank0: state.compute_device = _get_compute_device( rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 1050, in _get_compute_device rank0: raise ValueError( rank0: ValueError: Inconsistent compute device and
device_id
on rank 0: cuda:1 vs cuda:0My config file looks like this
compute_environment: LOCAL_MACHINE debug: true distributed_type: FSDP downcast_bf16: 'no' enable_cpu_affinity: false fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch: BACKWARD_PRE fsdp_cpu_ram_efficient_loading: true fsdp_forward_prefetch: true fsdp_offload_params: true fsdp_sharding_strategy: FULL_SHARD fsdp_state_dict_type: FULL_STATE_DICT fsdp_sync_module_states: true fsdp_use_orig_params: false machine_rank: 0 main_process_ip: 15.**** main_process_port: 29500 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 8 rdzv_backend: c10d same_network: false tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
on the other node, everything is same except the machine_rank is 1.
on the other terminal I am getting the same error
rank4: Traceback (most recent call last): rank4: File "train.py", line 157, in
rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1938, in train rank4: return inner_training_loop( rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2085, in _inner_training_loop rank4: self.model = self.accelerator.prepare(self.model) rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1292, in prepare rank4: result = tuple( rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1293, in
rank4: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1169, in _prepare_one
rank4: return self.prepare_model(obj, device_placement=device_placement)
rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1459, in prepare_model
rank4: model = FSDP(model, **kwargs)
rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 483, in init
rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 102, in _auto_wrap rank4: _recursive_wrap(recursive_wrap_kwargs, root_kwargs) # type: ignorearg-type: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 544, in _recursive_wrap rank4: wrapped_child, num_wrapped_params = _recursive_wrap( rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 544, in _recursive_wrap rank4: wrapped_child, num_wrapped_params = _recursive_wrap( rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 544, in _recursive_wrap rank4: wrapped_child, num_wrapped_params = _recursive_wrap( rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 562, in _recursive_wrap rank4: return _wrap(module, wrapper_cls, kwargs), nonwrapped_numel rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 491, in _wrap rank4: return wrapper_cls(module, kwargs) rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 509, in init
rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 587, in _init_param_handle_from_module rank4: state.compute_device = _get_compute_device( rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 1050, in _get_compute_device rank4: raise ValueError( rank4: ValueError: Inconsistent compute device and
device_id
on rank 4: cuda:1 vs cuda:0Any uggestions on how to solve it ?