Problem training with FSDP

agokrani commented 9 months ago

When I am trying to train a model with FSDP, I am getting following error.

*** TypeError: isinstance() arg 2 must be a type, a tuple of types, or a union

It is happening on this specific line trainer.model = trainer.model_wrapped = FSDP(trainer.model, **kwargs)

and after a bit of debugging it feels like it has something to do with auto_wrap_policy. I am not really sure how to solve this. Do you have any suggestions. It was working fine until few days ago.

we1k commented 9 months ago

I have encountered the same problem, however, it seem more like a problem with FSDP a peft wrapped model. When i run run_fsdp work fine for me, but when i try to add a lora config, which lead to the same error. @pacman100 could you please look at this problem?

we1k commented 9 months ago

found out this is caused by module_wrap_policy function in the FSDP(trainer.model).

Peft wrapped model passed a None as the module class variable.

After I mannually filter out the None module_class , another error occurs:

 File "train.py", line 197, in main
    trainer.model = trainer.model_wrapped = FSDP(trainer.model, **kwargs)
  File "/home/lzw/miniconda3/envs/Bert/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 487, in __init__
    _init_param_handle_from_module(
  File "/home/lzw/miniconda3/envs/Bert/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 519, in _init_param_handle_from_module
    _init_param_handle_from_params(state, managed_params, fully_sharded_module)
  File "/home/lzw/miniconda3/envs/Bert/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 531, in _init_param_handle_from_params
    handle = FlatParamHandle(
  File "/home/lzw/miniconda3/envs/Bert/lib/python3.8/site-packages/torch/distributed/fsdp/flat_param.py", line 537, in __init__
    self._init_flat_param_and_metadata(
  File "/home/lzw/miniconda3/envs/Bert/lib/python3.8/site-packages/torch/distributed/fsdp/flat_param.py", line 585, in _init_flat_param_and_metadata
    ) = self._validate_tensors_to_flatten(params)
  File "/home/lzw/miniconda3/envs/Bert/lib/python3.8/site-packages/torch/distributed/fsdp/flat_param.py", line 731, in _validate_tensors_to_flatten
    raise ValueError(
ValueError: Must flatten tensors with uniform `requires_grad` when `use_orig_params=False`

I'm not familar with FSDP, so still dont know how to figure it out.

pacman100 / LLM-Workshop

Problem training with FSDP #18