sammed-kamboj commented 2 months ago

Hello,

I am trying to finetune a llama3.1 on my custom dataset. I have access to a 2 nodes cluster with 4 gpus on each cluster. I am pretty new to finetuning on a multi node cluster. With whatever info I could find online, I ran the following code:

model = AutoModelForCausalLM.from_pretrained( "meta-llama/Meta-Llama-3.1-8B-Instruct", device_map="auto" )

model.config.pad_token_id = model.config.eos_token_id model.train() # model in training mode (dropout modules are activated)

enable gradient check pointing

model.gradient_checkpointing_enable() model.config.use_cache = False

tokenizer = AutoTokenizer.from_pretrained( "meta-llama/Meta-Llama-3.1-8B-Instruct", )

tokenizer.pad_token = tokenizer.eos_token

my_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm =False)

trainer = accelerator.prepare(Trainer( model=model, args=training_args, data_collator=my_collator, train_dataset=train_dataset, eval_dataset=val_dataset, tokenizer=tokenizer, ))

trainer.train()

I do load my personal dataset above the given code and everything is running smooth, but after loading in the model the script returns this error:

rank0: Traceback (most recent call last): rank0: File "train.py", line 157, in

rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1938, in train rank0: return inner_training_loop( rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2085, in _inner_training_loop rank0: self.model = self.accelerator.prepare(self.model) rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1311, in prepare rank0: result = tuple( rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1312, in rank0: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1188, in _prepare_one rank0: return self.prepare_model(obj, device_placement=device_placement) rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1485, in prepare_model rank0: model = FSDP(model, **kwargs) rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 483, in init

rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 102, in _auto_wrap rank0: _recursive_wrap(recursive_wrap_kwargs, root_kwargs) # type: ignorearg-type: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 544, in _recursive_wrap rank0: wrapped_child, num_wrapped_params = _recursive_wrap( rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 544, in _recursive_wrap rank0: wrapped_child, num_wrapped_params = _recursive_wrap( rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 544, in _recursive_wrap rank0: wrapped_child, num_wrapped_params = _recursive_wrap( rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 562, in _recursive_wrap rank0: return _wrap(module, wrapper_cls, kwargs), nonwrapped_numel rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 491, in _wrap rank0: return wrapper_cls(module, kwargs) rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 509, in init

rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 587, in _init_param_handle_from_module rank0: state.compute_device = _get_compute_device( rank0: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 1050, in _get_compute_device rank0: raise ValueError( rank0: ValueError: Inconsistent compute device and device_id on rank 0: cuda:1 vs cuda:0

My config file looks like this

compute_environment: LOCAL_MACHINE debug: true distributed_type: FSDP downcast_bf16: 'no' enable_cpu_affinity: false fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch: BACKWARD_PRE fsdp_cpu_ram_efficient_loading: true fsdp_forward_prefetch: true fsdp_offload_params: true fsdp_sharding_strategy: FULL_SHARD fsdp_state_dict_type: FULL_STATE_DICT fsdp_sync_module_states: true fsdp_use_orig_params: false machine_rank: 0 main_process_ip: 15.**** main_process_port: 29500 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 8 rdzv_backend: c10d same_network: false tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

on the other node, everything is same except the machine_rank is 1.

on the other terminal I am getting the same error

rank4: Traceback (most recent call last): rank4: File "train.py", line 157, in

rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1938, in train rank4: return inner_training_loop( rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2085, in _inner_training_loop rank4: self.model = self.accelerator.prepare(self.model) rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1292, in prepare rank4: result = tuple( rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1293, in rank4: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1169, in _prepare_one rank4: return self.prepare_model(obj, device_placement=device_placement) rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1459, in prepare_model rank4: model = FSDP(model, **kwargs) rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 483, in init

rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 102, in _auto_wrap rank4: _recursive_wrap(recursive_wrap_kwargs, root_kwargs) # type: ignorearg-type: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 544, in _recursive_wrap rank4: wrapped_child, num_wrapped_params = _recursive_wrap( rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 544, in _recursive_wrap rank4: wrapped_child, num_wrapped_params = _recursive_wrap( rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 544, in _recursive_wrap rank4: wrapped_child, num_wrapped_params = _recursive_wrap( rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 562, in _recursive_wrap rank4: return _wrap(module, wrapper_cls, kwargs), nonwrapped_numel rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 491, in _wrap rank4: return wrapper_cls(module, kwargs) rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 509, in init

rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 587, in _init_param_handle_from_module rank4: state.compute_device = _get_compute_device( rank4: File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 1050, in _get_compute_device rank4: raise ValueError( rank4: ValueError: Inconsistent compute device and device_id on rank 4: cuda:1 vs cuda:0

Any uggestions on how to solve it ?

muellerzr commented 2 months ago

Just use the Trainer, you don't need to wrap it under accelerate.prepare. This likely may be some of the issue. (it uses accelerate under the hood)

sammed-kamboj commented 2 months ago

@muellerzr Thank you for your reply! I removed the wrapping of Trainer in accelerate, but it is showing me the same error. To give more information, I have 2 machines on cloud with 4 V100 GPUs on each node. Each GPU is 32 GB. My data has max token length of 4000 so I am trying to run with max seq length of 4096.

Since I am loading Llama3.1 8B for fine tuning and using fp16, I thought using 2 nodes with 4 GPU of 32GB each is suffiecient compute for the setting.(Should I add one more node with 4 gpus? I can go upto 7-9 nodes with 4 GPUs on each node)
I believe the problem is something with fsdp_use_orig_params argument while setting up accelerate config. Whenever I tried to set it up True, it was giving the above error. When I set it to false, and then the script threw out of memory error.
To resolve it, I was using peft LoRA+ fsdp_use_orig_params=True+accelerate, again the same error.
```
ValueError: Inconsistent compute device and device_id on rank 0: cuda:1 vs cuda:0
```
I tried LoRA+fsdp_use_orig_params=False+accelerate and it threw me OOM error. Is it strange given my 8GPUs of 32GB each and with LoRA trainable parameters are only 10M. This is what I can see on my terminal

trainable params: 10,223,616 || all params: 8,040,484,864 || trainable%: 0.1272

I tried setting up deepspeed stage 3 in accelerate config, but again was getting OOM for fp16. (Had to remove device_map while loading the model, since I got an error saying deepspeed is not compatible with device map)
I tried LoRA + deepspeed and launched it with accelerate, and I was getting some NCCL network errors.
Finally, I tried LoRA+fsdp_use_orig_params=false with following accelerate config and the model is training. I can see all the GPUs being used till 18GB/32GB, on both the clusters, but I am not sure about the original error while using fsdp_use_orig_params = true,

compute_environment: LOCAL_MACHINE debug: true distributed_type: FSDP downcast_bf16: 'no' enable_cpu_affinity: false fsdp_config: fsdp_activation_checkpointing: false fsdp_auto_wrap_policy: SIZE_BASED_WRAP fsdp_backward_prefetch: BACKWARD_PRE fsdp_cpu_ram_efficient_loading: true fsdp_forward_prefetch: true fsdp_min_num_params: 10000000 fsdp_offload_params: true fsdp_sharding_strategy: FULL_SHARD fsdp_state_dict_type: SHARDED_STATE_DICT fsdp_sync_module_states: true fsdp_use_orig_params: false machine_rank: 0 main_process_ip: 15***** main_process_port: 29500 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 8 rdzv_backend: static same_network: false tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

wizeng23 commented 1 month ago

I encountered a similar issue trying to fine-tune Llama 3.1 70B using FSDP. I fixed the issue by setting device_map="cpu" when calling AutoModelForCausalLM.from_pretrained. The FSDP wrap seems to bring it onto GPU and shard the model. If the model were small enough to fit on one GPU (ex. if you loaded the model in bf16 precision), you could also try device_map=f"cuda:{int(os.environ.get("LOCAL_RANK", 0))}". My guess for the error is the FSDP wrap doesn't like getting a model sharded across GPUs as input, which is what device_map="auto" will do.

sammed-kamboj commented 1 month ago

@wizeng23 Your suggestion solved my problem! Thank you!

huggingface / accelerate

Multinode training ValueError: Inconsistent compute device and `device_id` on rank 0: cuda:1 vs cuda:0 #2963

enable gradient check pointing