huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.8k stars 946 forks source link

ValueError: FlatParameter requires uniform dtype but got torch.float16 and torch.float32 #1620

Open JamesDConley opened 1 year ago

JamesDConley commented 1 year ago

System Info

Please see
https://github.com/huggingface/peft/issues/484#issue-1718704717

Information

Tasks

Reproduction

See https://github.com/huggingface/peft/issues/484

Expected behavior

The training code is able to handle the selected FP16 weights selected via accelerate config. Apologies for linking everything but it's all been provided already by another OP and I am up too late already debugging.

sgugger commented 1 year ago

Please fill the issue templates for the specific bug in Accelerate or close this. There is no point opening an issue in different repos if it's all the same.

JamesDConley commented 1 year ago

Please fill the issue templates for the specific bug in Accelerate or close this. There is no point opening an issue in different repos if it's all the same.

I linked this because someone else opened an issue in PEFT but the issue appears to actually be in accelerate, so wanted to make sure the right eyes see this issue.

JamesDConley commented 1 year ago

Here's the full original issue in PEFT

The examples/conditional_generation/peft_lora_seq2seq_accelerate_fsdp.py appears to be incompatible with FSDP.
It appears others have also noticed the issue. https://github.com/h2oai/h2o-llmstudio/issues/98

I've created a stripped down version that I run with the accelerate launcher

accelerate launch train.py

My launcher config

- `Accelerate` version: 0.18.0
- Platform: Linux-6.1.24-x86_64-with-glibc2.37
- Python version: 3.10.10
- Numpy version: 1.24.3
- PyTorch version (GPU?): 1.13.1+cu117 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: FSDP
        - mixed_precision: fp16
        - use_cpu: False
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - main_process_ip: 0.0.0.0
        - main_process_port: 8080
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - fsdp_config: {'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch_policy': 'BACKWARD_PRE', 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 1, 'fsdp_state_dict_type': 'FULL_STATE_DICT', 'fsdp_transformer_layer_cls_to_wrap': 'T5Block'}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

import torch
from accelerate import Accelerator
from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    get_linear_schedule_with_warmup,
)

from peft import LoraConfig, TaskType, get_peft_model
from peft.utils.other import fsdp_auto_wrap_policy

def main():
    accelerator = Accelerator()
    model_name_or_path = "t5-base"
    lr = 1e-3
    num_epochs = 1

    peft_config = LoraConfig(
        task_type=TaskType.SEQ_2_SEQ_LM,
        inference_mode=False,
        r=8,
        lora_alpha=32,
        lora_dropout=0.1,
    )
    model = AutoModelForSeq2SeqLM.from_pretrained(
        model_name_or_path,
        torch_dtype=torch.float16,
    )
    model = get_peft_model(model, peft_config)
    accelerator.print(model.print_trainable_parameters())

    AutoTokenizer.from_pretrained(model_name_or_path)

    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
    lr_scheduler = get_linear_schedule_with_warmup(
        optimizer=optimizer,
        num_warmup_steps=0,
        num_training_steps=(8 * num_epochs),
    )

    if getattr(accelerator.state, "fsdp_plugin", None) is not None:
        accelerator.state.fsdp_plugin.auto_wrap_policy = fsdp_auto_wrap_policy(model)

    (
        model,
        optimizer,
        lr_scheduler,
    ) = accelerator.prepare(model, optimizer, lr_scheduler)
    accelerator.print(model)

if __name__ == "__main__":
    main()

The full stacktrace is as follows:

accelerate launch --config_file ./finetune/launcher_configs/accelerate_fsdp_no_offload_config.yaml ./finetune/peft_lora_seq2seq_accelerate_fsdp.py t5-base

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
CUDA SETUP: CUDA runtime path found: /nix/store/0781hi5c3vb0v7h0s701adqgg4531qib-cuda-home/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
CUDA SETUP: CUDA runtime path found: /nix/store/0781hi5c3vb0v7h0s701adqgg4531qib-cuda-home/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
trainable params: 884736 || all params: 223788288 || trainable%: 0.3953450861557152
trainable params: 884736 || all params: 223788288 || trainable%: 0.3953450861557152
None
/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/transformers/models/t5/tokenization_t5_fast.py:155: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
 warnings.warn(
/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/transformers/models/t5/tokenization_t5_fast.py:155: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
 warnings.warn(
FSDP Warning: When using FSDP, it is efficient and recommended to call prepare for the model before creating the optimizer
Traceback (most recent call last):
 File "/home/markh/text-fine-tuning-experiments/./finetune/peft_lora_seq2seq_accelerate_fsdp.py", line 54, in <module>
   main()
 File "/home/markh/text-fine-tuning-experiments/./finetune/peft_lora_seq2seq_accelerate_fsdp.py", line 49, in main
   ) = accelerator.prepare(model, optimizer, lr_scheduler)
 File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1122, in prepare
   result = tuple(
 File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1123, in <genexpr>
   self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
 File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 977, in _prepare_one
   return self.prepare_model(obj, device_placement=device_placement)
 File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1227, in prepare_model
   model = FSDP(model, **kwargs)
 File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1036, in __init__
   self._auto_wrap(auto_wrap_kwargs, fsdp_kwargs)
 File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1291, in _auto_wrap
   _recursive_wrap(**auto_wrap_kwargs, **fsdp_kwargs)
 File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 403, in _recursive_wrap
   wrapped_child, num_wrapped_params = _recursive_wrap(
 File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 403, in _recursive_wrap
   wrapped_child, num_wrapped_params = _recursive_wrap(
 File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 403, in _recursive_wrap
   wrapped_child, num_wrapped_params = _recursive_wrap(
 [Previous line repeated 2 more times]
 File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 421, in _recursive_wrap
   return _wrap(module, wrapper_cls, **kwargs), num_params
 File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 350, in _wrap
   return wrapper_cls(module, **kwargs)
 File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1079, in __init__
   self._fsdp_wrapped_module = FlattenParamsWrapper(
 File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/flatten_params_wrapper.py", line 103, in __init__
   self._flat_param_handle = FlatParamHandle(params, module, device, config)
 File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/flat_param.py", line 270, in __init__
   self._init_flat_param(params, module)
 File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/flat_param.py", line 330, in _init_flat_param
   raise ValueError(
ValueError: `FlatParameter` requires uniform dtype but got torch.float16 and torch.float32
JamesDConley commented 1 year ago

My own example/details

System Info

Relevant Package Versions

torch==2.0.1
torchaudio==2.0.2+cu118
torchvision==0.15.2+cu118
peft==0.3.0
accelerate==0.20.3

Using base container nvidia/cuda:11.8.0-devel-ubuntu22.04 in docker on a linux box with 2x A6000 GPUs running ubuntu 22.04

When does this occur?

When using custom scripts

Tasks

My own custom task/dataset, although it fails preparing the model before training so that's not relevant. I've stripped out the dataset code for the minimal example and just passed None for simplicity. It raises the same error in either case.

Reproduction

Condensed Script

import torch
import logging

from accelerate import Accelerator
from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoTokenizer, AutoModelForCausalLM, AdamW, get_constant_schedule_with_warmup

logging.basicConfig(level=logging.INFO, format="%(asctime)s.%(msecs)03d %(levelname)s: %(message)s", datefmt="%Y-%m-%d %H:%M:%S",)
logger = logging.getLogger(__name__)
accelerator = Accelerator()

BASE_MODEL = "tiiuae/falcon-7b"
USE_GRADIENT_CHECKPOINTING = True
TARGET_MODULES = ["query_key_value"]

# Setup Tokenizer
logger.info("Setting up Tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, cache_dir="/app/models/hface_cache", use_auth_token=None)
tokenizer.pad_token = tokenizer.eos_token

# Setup Model
logger.info("Setting up Model...")
model = AutoModelForCausalLM.from_pretrained(
        BASE_MODEL, cache_dir="/app/models/hface_cache", use_cache=False, torch_dtype=torch.float16, use_auth_token=None, trust_remote_code=True)

# Setup PEFT
logger.info("Setting up PEFT Config...")
peft_config = LoraConfig(
            task_type=TaskType.CAUSAL_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1, target_modules=TARGET_MODULES
        )

model.enable_input_require_grads()
logger.info("Converting Model to PEFT...")
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
if USE_GRADIENT_CHECKPOINTING:
    logger.info("Enabling Gradient Checkpointing...")
    model.gradient_checkpointing_enable()

_ = model.train()

# Prepare for training
# Setup optimizer & learning rate scheduler
opt = AdamW(model.parameters(), lr=0.0001)
scheduler = get_constant_schedule_with_warmup(opt, num_warmup_steps=100)

# Accelerate Components
logger.info("Wrapping objects with accelerate...")
model, opt, _, scheduler = accelerator.prepare(
     model, opt, None, scheduler
)

Reproduction

Full terminal output with script

root@805a1946b2f3:/app# accelerate config
-----------------------------------------------------------------------------------------------------------------------------------------In which compute environment are you running?
This machine                                                                                                                             
-----------------------------------------------------------------------------------------------------------------------------------------Which type of machine are you using?                                                                                                     
multi-GPU                                                                                                                                
How many different machines will you use (use more than 1 for multi-node training)? [1]:                                                 
Do you wish to optimize your script with torch dynamo?[yes/NO]:                                                                          
Do you want to use DeepSpeed? [yes/NO]:                                                                                                  
Do you want to use FullyShardedDataParallel? [yes/NO]: yes                                                                               
-----------------------------------------------------------------------------------------------------------------------------------------What should be your sharding strategy?
FULL_SHARD                                                                                                                               
Do you want to offload parameters and gradients to CPU? [yes/NO]: yes                                                                    
-----------------------------------------------------------------------------------------------------------------------------------------What should be your auto wrap policy?                                                                                                    
TRANSFORMER_BASED_WRAP                                                                                                                   
Specify the comma-separated list of transformer layer class names (case-sensitive) to wrap ,e.g, :`BertLayer`, `GPTJBlock`, `T5Block`, `BertLayer,BertEmbeddings,BertSelfOutput` ...? : DecoderLayer                                                                              
-----------------------------------------------------------------------------------------------------------------------------------------What should be your FSDP's backward prefetch policy?
BACKWARD_PRE                                                                                                                             
-----------------------------------------------------------------------------------------------------------------------------------------What should be your FSDP's state dict type?                                                                                              
FULL_STATE_DICT                                                                                                                          
How many GPU(s) should be used for distributed training? [1]:2                                                                           
-----------------------------------------------------------------------------------------------------------------------------------------Do you wish to use FP16 or BF16 (mixed precision)?                                                                                       
fp16                                                                                                                                     
accelerate configuration saved at /root/.cache/huggingface/accelerate/default_config.yaml                                                
root@805a1946b2f3:/app# accelerate launch src/minimal_example.py                                                                         
2023-06-22 03:50:07.968 INFO: Created a temporary directory at /tmp/tmpe4jhd31z                                                          
2023-06-22 03:50:07.968 INFO: Created a temporary directory at /tmp/tmpcrhlvm3j                                                          
2023-06-22 03:50:07.968 INFO: Writing /tmp/tmpe4jhd31z/_remote_module_non_scriptable.py
2023-06-22 03:50:07.968 INFO: Writing /tmp/tmpcrhlvm3j/_remote_module_non_scriptable.py
2023-06-22 03:50:08.031 INFO: Added key: store_based_barrier_key:1 to store for rank: 1
2023-06-22 03:50:08.031 INFO: Added key: store_based_barrier_key:1 to store for rank: 0
2023-06-22 03:50:08.031 INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023-06-22 03:50:08.031 INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023-06-22 03:50:08.048 INFO: Setting up Tokenizer...
2023-06-22 03:50:08.048 INFO: Setting up Tokenizer...
2023-06-22 03:50:08.289 INFO: Setting up Model...
2023-06-22 03:50:08.290 INFO: Setting up Model...
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00,  6.07s/it]
2023-06-22 03:52:19.813 INFO: Setting up PEFT Config...
2023-06-22 03:52:19.813 INFO: Converting Model to PEFT...
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00,  6.18s/it]
2023-06-22 03:52:22.155 INFO: Setting up PEFT Config...
2023-06-22 03:52:22.156 INFO: Converting Model to PEFT...
trainable params: 2359296 || all params: 6924080000 || trainable%: 0.03407378308742822
2023-06-22 03:52:25.845 INFO: Enabling Gradient Checkpointing...
/usr/local/lib/python3.9/dist-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
2023-06-22 03:52:25.849 INFO: Wrapping objects with accelerate...
Traceback (most recent call last):
  File "/app/src/minimal_example.py", line 87, in <module>
    model, opt, _, scheduler = accelerator.prepare(
  File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1182, in prepare
    result = tuple(
  File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1183, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1022, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1300, in prepare_model
    model = FSDP(model, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 391, in __init__
    _auto_wrap(auto_wrap_kwargs, fsdp_kwargs, FullyShardedDataParallel)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 73, in _auto_wrap
    _recursive_wrap(**auto_wrap_kwargs, **fsdp_kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  [Previous line repeated 2 more times]
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/wrap.py", line 388, in _recursive_wrap
    return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/wrap.py", line 317, in _wrap
    return wrapper_cls(module, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 408, in __init__
    _init_param_handle_from_module(
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/_init_utils.py", line 429, in _init_param_handle_from_module
    _init_param_handle_from_params(state, managed_params, fully_sharded_module)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/_init_utils.py", line 525, in _init_param_handle_from_params
    handle = FlatParamHandle(
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/flat_param.py", line 366, in __init__
    self._init_flat_param(params, fully_sharded_module, use_orig_params)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/flat_param.py", line 430, in _init_flat_param
    raise ValueError(
ValueError: `FlatParameter` requires uniform dtype but got torch.float16 and torch.float32
trainable params: 2359296 || all params: 6924080000 || trainable%: 0.03407378308742822
2023-06-22 03:52:28.238 INFO: Enabling Gradient Checkpointing...
/usr/local/lib/python3.9/dist-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
2023-06-22 03:52:28.242 INFO: Wrapping objects with accelerate...
2023-06-22 03:52:28.242 WARNING: FSDP Warning: When using FSDP, it is efficient and recommended to call prepare for the model before creating the optimizer
Traceback (most recent call last):
  File "/app/src/minimal_example.py", line 87, in <module>
    model, opt, _, scheduler = accelerator.prepare(
  File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1182, in prepare
    result = tuple(
  File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1183, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1022, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1300, in prepare_model
    model = FSDP(model, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 391, in __init__
    _auto_wrap(auto_wrap_kwargs, fsdp_kwargs, FullyShardedDataParallel)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 73, in _auto_wrap
    _recursive_wrap(**auto_wrap_kwargs, **fsdp_kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  [Previous line repeated 2 more times]
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/wrap.py", line 388, in _recursive_wrap
    return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/wrap.py", line 317, in _wrap
    return wrapper_cls(module, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 408, in __init__
    _init_param_handle_from_module(
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/_init_utils.py", line 429, in _init_param_handle_from_module
    _init_param_handle_from_params(state, managed_params, fully_sharded_module)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/_init_utils.py", line 525, in _init_param_handle_from_params
    handle = FlatParamHandle(
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/flat_param.py", line 366, in __init__
    self._init_flat_param(params, fully_sharded_module, use_orig_params)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/flat_param.py", line 430, in _init_flat_param
    raise ValueError(
ValueError: `FlatParameter` requires uniform dtype but got torch.float16 and torch.float32
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2139) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 928, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 627, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
src/minimal_example.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-06-22_03:52:31
  host      : 805a1946b2f3
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2140)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-22_03:52:31
  host      : 805a1946b2f3
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2139)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
root@805a1946b2f3:/app# 

Expected behavior

The model is loaded with FSDP across the 2 GPUs without crashing

pacman100 commented 1 year ago

Hello, FSDP with PEFT isn't leading to any memory savings when compared to plain pytorch. see this https://github.com/pytorch/pytorch/issues/91165#issuecomment-160080533, It also shows how to use FSDP with PEFT nonetheless.

JamesDConley commented 1 year ago

Hello, FSDP with PEFT isn't leading to any memory savings when compared to plain pytorch. see this pytorch/pytorch#91165 (comment), It also shows how to use FSDP with PEFT nonetheless.

Thanks for the heads up. I tested with Deepspeed ZeRo (3) last night and managed to get Falcon-40B training working so I'll continue on with that instead :+1:

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

rfan-debug commented 9 months ago

I got the same error even after pre-casting all modules' parameters to be torch.float16. Any update on this issue?

g-h-chen commented 9 months ago

Hello, FSDP with PEFT isn't leading to any memory savings when compared to plain pytorch. see this pytorch/pytorch#91165 (comment), It also shows how to use FSDP with PEFT nonetheless.

Thanks for the heads up. I tested with Deepspeed ZeRo (3) last night and managed to get Falcon-40B training working so I'll continue on with that instead 👍

Hi bro, how did you fix that? Still stuck with the error ValueError: FlatParameter requires uniform dtype but got torch.float16 and torch.float32

SicariusSicariiStuff commented 6 months ago

same issue here, seems fsdp aint playing nice with peft.

vikram71198 commented 5 months ago

The error message I see is slightly different:

ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32

But, I think it's the same issue other folks on here seem to be facing. This happens when use the fsdp & fsdp_config params in TrainingArguments, so I'm not explicitly using Accelerate, but it is being used under the hood nevertheless.

jacquesqiao commented 4 months ago

set FSDP_CPU_RAM_EFFICIENT_LOADING=1 solve the problem...

aasthavar commented 4 months ago

I tried launching the script with FSDP_CPU_RAM_EFFICIENT_LOADING=1 but didn't work . Having same issue.

This is the blog I am following.

My command: FSDP_CPU_RAM_EFFICIENT_LOADING=1 torchrun --nproc_per_node=4 run_fsdp_qlora.py --config config.yaml

These are the libraries:

%pip install --quiet \
    "torch==2.2.2" tensorboard 

# Install Hugging Face libraries
%pip install  --upgrade --quiet \
    "transformers==4.40.0" "datasets==2.18.0" "accelerate==0.29.3" "evaluate==0.4.1" "bitsandbytes==0.43.1" "huggingface_hub==0.22.2" "trl==0.8.6" "peft==0.10.0"

Any suggestions how to solve or further investigate the issue ? Is there any specific library version I am missing ?

muellerzr commented 3 months ago

Reopening as came across this myself. Correct me if I'm wrong, have we enabled any max_grad_norm?

muellerzr commented 3 months ago

Setting it to 0 manually "fixes" this, as the issue comes from doing FSDP + grad norm. I'll check with the PyTorch team to see what we can do to fix this.

HeenaRajan commented 3 months ago

Hi @muellerzr Did you find any solution? I am also facing the same issue. I am using accelerate==0.30.1 and no max_grad_norm.

tle211212 commented 3 months ago

Hi @muellerzr Did you find any solution? I am also facing the same issue. I am using accelerate==0.30.1 and no max_grad_norm.

Not sure if it is the same issue. In my case, I used the sample code created by Schmid (https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/fsdp-qlora-distributed-llama3.ipynb)

When I used newer transformers lib >= 4.41.0, I encountered the error. I looked at the changes between 4.40.2 and 4.41.0, I found this changeset https://github.com/huggingface/transformers/commit/f16caf44bb1606652ac6c7c4ad4bf44973d4e545.

Then I was able to make the code work again by add the "cpu_ram_efficient_loading" to the fsdp_config. ie.

fsdp_config: backward_prefetch: "backward_pre" forward_prefetch: "false" use_orig_params: "false" cpu_ram_efficient_loading: "true" ## NEWLY ADDED sync_module_states: "true"

HeenaRajan commented 2 months ago

Hi @tle211212 Thanks for your suggestion. I did try setting "cpu_ram_efficient_loading" to true in fsdp_config and don't get the tensor type mismatch error now.

I am using transformer lib ==4.42.4 and torch==2.3.1

fsdp_config: backward_prefetch: "backward_pre" forward_prefetch: "false" use_orig_params: "false" limit_all_gathers: "true" sync_module_states: "true" cpu_ram_efficient_loading: "true"

However, I am getting another error: output tensor size must be equal to world_size times input tensor size

Command: ACCELERATE_USE_FSDP=1 CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=1 --nproc_per_node=4 model.py --config mistral_qlora_fsdp.yaml

 File "/home/jupyter/model_phil.py", line 189, in <module>
[rank2]:     training_function(script_args, training_args)
[rank2]:   File "/home/jupyter/model_phil.py", line 169, in training_function
[rank2]:     trainer.train()
[rank2]:   File "/opt/conda/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 361, in train
[rank2]:     output = super().train(*args, **kwargs)
[rank2]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in train
[rank2]:     return inner_training_loop(
[rank2]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2345, in _inner_training_loop
[rank2]:     self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
[rank2]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2796, in _maybe_log_save_evaluate
[rank2]:     self._save_checkpoint(model, trial, metrics=metrics)
[rank2]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2879, in _save_checkpoint
[rank2]:     self._save_optimizer_and_scheduler(output_dir)
[rank2]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2990, in _save_optimizer_and_scheduler
[rank2]:     save_fsdp_optimizer(
[rank2]:   File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/fsdp_utils.py", line 157, in save_fsdp_optimizer
[rank2]:     optim_state = FSDP.optim_state_dict(model, optimizer)
[rank2]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1840, in optim_state_dict
[rank2]:     return FullyShardedDataParallel._optim_state_dict_impl(
[rank2]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1263, in _optim_state_dict_impl
[rank2]:     return _optim_state_dict(
[rank2]:   File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank2]:     return func(*args, **kwargs)
[rank2]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1971, in _optim_state_dict
[rank2]:     fsdp_osd_state = convert_fn(
[rank2]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1834, in _convert_state_with_flat_params
[rank2]:     unflat_state = _unflatten_optim_state(
[rank2]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 160, in _unflatten_optim_state
[rank2]:     consolidated_state = _communicate_optim_state(
[rank2]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 239, in _communicate_optim_state
[rank2]:     dist.all_gather_into_tensor(
[rank2]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank2]:     return func(*args, **kwargs)
[rank2]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2948, in all_gather_into_tensor
[rank2]:     work = group._allgather_base(output_tensor, input_tensor, opts)
[rank2]: ValueError: output tensor size must be equal to world_size times input tensor size

Any solution/suggestion to fix this. Thanks.

tengerye commented 2 months ago

Saddly to find this bug has not been fixed after more than one year.

github-actions[bot] commented 4 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

thusinh1969 commented 1 week ago

No stale. The BUG is still there. FSDP is a great tool for large and long context training, please fix it. Latest all libs installed.

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_activation_checkpointing: false
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: true
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

ERROR

File "/root/yarn/finetune.py", line 525, in <module>
[rank0]:     main(args.parse_args())
[rank0]:   File "/root/yarn/finetune.py", line 367, in main
[rank0]:     model = accelerator.prepare(model)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1326, in prepare
[rank0]:     result = tuple(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1327, in <genexpr>
[rank0]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1200, in _prepare_one
[rank0]:     return self.prepare_model(obj, device_placement=device_placement)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1484, in prepare_model
[rank0]:     model = FSDP(model, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 483, in __init__
[rank0]:     _auto_wrap(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 102, in _auto_wrap
[rank0]:     _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs)  # type: ignore[arg-type]
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 544, in _recursive_wrap
[rank0]:     wrapped_child, num_wrapped_params = _recursive_wrap(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 544, in _recursive_wrap
[rank0]:     wrapped_child, num_wrapped_params = _recursive_wrap(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 544, in _recursive_wrap
[rank0]:     wrapped_child, num_wrapped_params = _recursive_wrap(
[rank0]:   [Previous line repeated 2 more times]
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 562, in _recursive_wrap
[rank0]:     return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 491, in _wrap
[rank0]:     return wrapper_cls(module, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 509, in __init__
[rank0]:     _init_param_handle_from_module(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 603, in _init_param_handle_from_module
[rank0]:     _init_param_handle_from_params(state, managed_params, fully_sharded_module)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 615, in _init_param_handle_from_params
[rank0]:     handle = FlatParamHandle(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 583, in __init__
[rank0]:     self._init_flat_param_and_metadata(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 633, in _init_flat_param_and_metadata
[rank0]:     ) = self._validate_tensors_to_flatten(params)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 771, in _validate_tensors_to_flatten
[rank0]:     raise ValueError(
[rank0]: ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32

Thanks, Steve

nivibilla commented 2 days ago

+1

Tested with fsdp with qlora on qwen 7b using accelerate launcher.

Launching training on 8 GPUs.
2024-10-09 14:21:18.291308: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-09 14:21:18.351577: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-09 14:21:18.353145: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-09 14:21:18.353147: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-09 14:21:18.385038: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-09 14:21:18.387633: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-09 14:21:18.387653: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-09 14:21:18.428540: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
When using FSDP full shard, instead of using `gradient_checkpointing` in TrainingArguments, please use `activation_checkpointing` in `fsdp_config`. The former introduces a redundant AllGather operation in backward pass. Reference: https://github.com/huggingface/transformers/issues/30404
When using FSDP full shard, instead of using `gradient_checkpointing` in TrainingArguments, please use `activation_checkpointing` in `fsdp_config`. The former introduces a redundant AllGather operation in backward pass. Reference: https://github.com/huggingface/transformers/issues/30404
When using FSDP full shard, instead of using `gradient_checkpointing` in TrainingArguments, please use `activation_checkpointing` in `fsdp_config`. The former introduces a redundant AllGather operation in backward pass. Reference: https://github.com/huggingface/transformers/issues/30404
When using FSDP full shard, instead of using `gradient_checkpointing` in TrainingArguments, please use `activation_checkpointing` in `fsdp_config`. The former introduces a redundant AllGather operation in backward pass. Reference: https://github.com/huggingface/transformers/issues/30404
When using FSDP full shard, instead of using `gradient_checkpointing` in TrainingArguments, please use `activation_checkpointing` in `fsdp_config`. The former introduces a redundant AllGather operation in backward pass. Reference: https://github.com/huggingface/transformers/issues/30404
When using FSDP full shard, instead of using `gradient_checkpointing` in TrainingArguments, please use `activation_checkpointing` in `fsdp_config`. The former introduces a redundant AllGather operation in backward pass. Reference: https://github.com/huggingface/transformers/issues/30404
When using FSDP full shard, instead of using `gradient_checkpointing` in TrainingArguments, please use `activation_checkpointing` in `fsdp_config`. The former introduces a redundant AllGather operation in backward pass. Reference: https://github.com/huggingface/transformers/issues/30404
When using FSDP full shard, instead of using `gradient_checkpointing` in TrainingArguments, please use `activation_checkpointing` in `fsdp_config`. The former introduces a redundant AllGather operation in backward pass. Reference: https://github.com/huggingface/transformers/issues/30404

WARNING:root:No handler for 1e2b71a0a97c482db4ccfc57a77d2fcc
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
[2024-10-09 14:22:01,370] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-09 14:22:01,390] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-09 14:22:01,396] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-09 14:22:01,397] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-09 14:22:01,397] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-09 14:22:01,422] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
[2024-10-09 14:22:01,520] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found. [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.

 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  async_io: please install the libaio-dev package with apt [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  async_io: please install the libaio-dev package with apt [WARNING]  async_io requires the dev libaio .so object and headers but these were not found. [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.

 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH

 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH

 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[2024-10-09 14:22:01,636] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -laio: No such file or directory
/usr/bin/ld: cannot find -laio: No such file or directory
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
collect2: error: ld returned 1 exit status
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3

 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible

 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
W1009 14:22:02.871075 140470744776704 torch/multiprocessing/spawn.py:145] Terminating process 63148 via signal SIGTERM
W1009 14:22:02.872952 140470744776704 torch/multiprocessing/spawn.py:145] Terminating process 63150 via signal SIGTERM
W1009 14:22:02.873525 140470744776704 torch/multiprocessing/spawn.py:145] Terminating process 63151 via signal SIGTERM
W1009 14:22:02.873933 140470744776704 torch/multiprocessing/spawn.py:145] Terminating process 63152 via signal SIGTERM
W1009 14:22:02.874345 140470744776704 torch/multiprocessing/spawn.py:145] Terminating process 63153 via signal SIGTERM
W1009 14:22:02.874789 140470744776704 torch/multiprocessing/spawn.py:145] Terminating process 63154 via signal SIGTERM
W1009 14:22:02.875156 140470744776704 torch/multiprocessing/spawn.py:145] Terminating process 63155 via signal SIGTERM
W1009 14:22:32.908281 140470744776704 torch/multiprocessing/spawn.py:153] Unable to shutdown process 63148 via SIGTERM , forcefully exiting via SIGKILL
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] failed (exitcode: 1) local_rank: 1 (pid: 63149) of fn: fsdp_train (start_method: fork)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] Traceback (most recent call last):
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/databricks/python/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 656, in _poll
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     self._pc.join(-1)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/databricks/python/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 188, in join
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     raise ProcessRaisedException(msg, error_index, failed_process.pid)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] torch.multiprocessing.spawn.ProcessRaisedException: 
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] 
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] -- Process 1 terminated with the following error:
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] Traceback (most recent call last):
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/databricks/python/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 75, in _wrap
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     fn(i, *args)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/databricks/python/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 580, in _wrap
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     ret = record(fn)(*args_)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]           ^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/databricks/python/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     return f(*args, **kwargs)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]            ^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/root/.ipykernel/62919/command-1189292846020885-496091015", line 108, in fsdp_train
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     trainer.train()
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-526a0c7c-9fb3-498c-be7d-39bbf80f2668/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 434, in train
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     output = super().train(*args, **kwargs)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/databricks/python/lib/python3.11/site-packages/mlflow/utils/autologging_utils/safety.py", line 460, in safe_patch_function
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     return original(*args, **kwargs)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]            ^^^^^^^^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/databricks/python_shell/dbruntime/huggingface_patches/transformers.py", line 54, in patched_fit_function
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     model = original_method(self, *args, **kwargs)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/transformers/trainer.py", line 2052, in train
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     return inner_training_loop(
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]            ^^^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/transformers/trainer.py", line 2194, in _inner_training_loop
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     self.model = self.accelerator.prepare(self.model)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/accelerate/accelerator.py", line 1326, in prepare
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     result = tuple(
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]              ^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/accelerate/accelerator.py", line 1327, in <genexpr>
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/accelerate/accelerator.py", line 1200, in _prepare_one
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     return self.prepare_model(obj, device_placement=device_placement)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/accelerate/accelerator.py", line 1484, in prepare_model
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     model = FSDP(model, **kwargs)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]             ^^^^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 485, in __init__
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     _auto_wrap(
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs)  # type: ignore[arg-type]
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     wrapped_child, num_wrapped_params = _recursive_wrap(
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]                                         ^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     wrapped_child, num_wrapped_params = _recursive_wrap(
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]                                         ^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     wrapped_child, num_wrapped_params = _recursive_wrap(
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]                                         ^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   [Previous line repeated 2 more times]
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     return wrapper_cls(module, **kwargs)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 511, in __init__
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     _init_param_handle_from_module(
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/_init_utils.py", line 598, in _init_param_handle_from_module
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     _init_param_handle_from_params(state, managed_params, fully_sharded_module)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/_init_utils.py", line 610, in _init_param_handle_from_params
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     handle = FlatParamHandle(
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]              ^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/_flat_param.py", line 582, in __init__
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     self._init_flat_param_and_metadata(
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/_flat_param.py", line 632, in _init_flat_param_and_metadata
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     ) = self._validate_tensors_to_flatten(params)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]   File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/_flat_param.py", line 770, in _validate_tensors_to_flatten
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]     raise ValueError(
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] 
ChildFailedError: 
============================================================
fsdp_train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-10-09_14:22:02
  host      : 0823-062625-tzq5t3e1-10-168-70-23
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 63149)
  error_file: /tmp/torchelastic_db2i3dhw/none_gw9dnnic/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/databricks/python/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
      return f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
    File "/root/.ipykernel/62919/command-1189292846020885-496091015", line 108, in fsdp_train
      trainer.train()
    File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-526a0c7c-9fb3-498c-be7d-39bbf80f2668/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 434, in train
      output = super().train(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/databricks/python/lib/python3.11/site-packages/mlflow/utils/autologging_utils/safety.py", line 460, in safe_patch_function
      return original(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/databricks/python_shell/dbruntime/huggingface_patches/transformers.py", line 54, in patched_fit_function
      model = original_method(self, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/transformers/trainer.py", line 2052, in train
      return inner_training_loop(
             ^^^^^^^^^^^^^^^^^^^^
    File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/transformers/trainer.py", line 2194, in _inner_training_loop
      self.model = self.accelerator.prepare(self.model)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/accelerate/accelerator.py", line 1326, in prepare
      result = tuple(
               ^^^^^^
    File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/accelerate/accelerator.py", line 1327, in <genexpr>
      self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/accelerate/accelerator.py", line 1200, in _prepare_one
      return self.prepare_model(obj, device_placement=device_placement)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/accelerate/accelerator.py", line 1484, in prepare_model
      model = FSDP(model, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
    File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 485, in __init__
      _auto_wrap(
    File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
      _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs)  # type: ignore[arg-type]
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
      wrapped_child, num_wrapped_params = _recursive_wrap(
                                          ^^^^^^^^^^^^^^^^
    File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
      wrapped_child, num_wrapped_params = _recursive_wrap(
                                          ^^^^^^^^^^^^^^^^
    File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
      wrapped_child, num_wrapped_params = _recursive_wrap(
                                          ^^^^^^^^^^^^^^^^
    [Previous line repeated 2 more times]
    File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
      return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
      return wrapper_cls(module, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 511, in __init__
      _init_param_handle_from_module(
    File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/_init_utils.py", line 598, in _init_param_handle_from_module
      _init_param_handle_from_params(state, managed_params, fully_sharded_module)
    File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/_init_utils.py", line 610, in _init_param_handle_from_params
      handle = FlatParamHandle(
               ^^^^^^^^^^^^^^^^
    File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/_flat_param.py", line 582, in __init__
      self._init_flat_param_and_metadata(
    File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/_flat_param.py", line 632, in _init_flat_param_and_metadata
      ) = self._validate_tensors_to_flatten(params)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/_flat_param.py", line 770, in _validate_tensors_to_flatten
      raise ValueError(
  ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32

============================================================
File <command-1189292846020892>, line 6
      3 os.environ["ACCELERATE_USE_FSDP"] = '1'
      4 os.environ["FSDP_CPU_RAM_EFFICIENT_LOADING"] = '1'
----> 6 notebook_launcher(fsdp_train, num_processes=8, mixed_precision='bf16', use_port='12345')
File /databricks/python/lib/python3.11/site-packages/torch/distributed/launcher/api.py:263, in launch_agent(config, entrypoint, args)
    256     events.record(agent.get_event_succeeded())
    258     if result.is_failed():
    259         # ChildFailedError is treated specially by @record
    260         # if the error files for the failed children exist
    261         # @record will copy the first error (root cause)
    262         # to the error file of the launcher process.
--> 263         raise ChildFailedError(
    264             name=entrypoint_name,
    265             failures=result.failures,
    266         )
    268     return result.return_values
    269 except ChildFailedError:
BenjaminBossan commented 1 day ago

@thusinh1969 are you also using LoRA/QLoRA or normal fine-tuning?

@nivibilla Could you please show your train script, or at the very least how the base model and PEFT model are initialized?

nivibilla commented 1 day ago

@BenjaminBossan sure

def fsdp_train():
    from dataclasses import dataclass

    import datasets
    import torch
    import transformers
    from trl import SFTConfig, SFTTrainer
    from peft import LoraConfig, TaskType, get_peft_model

    import json
    import os
    os.environ["ACCELERATE_USE_FSDP"] = '1'
    os.environ["FSDP_CPU_RAM_EFFICIENT_LOADING"] = '1'

    with open('/local_disk0/training_config.json') as f:
        training_config = json.load(f)

    # # testing memory usage for batch size
    training_config['max_steps'] = 50
    # training_config['per_device_train_batch_size'] = 32
    # print(json.dumps(training_config, indent=4))

    tokenizer = transformers.AutoTokenizer.from_pretrained(
        training_config['model_name_or_path'],
        padding_side="left",
        truncation_side="left",
    )
    tokenizer.pad_token = tokenizer.eos_token

    train_dataset = datasets.load_from_disk('/local_disk0/train')

    # Model    
    torch_dtype = torch.bfloat16
    quant_storage_dtype = torch.bfloat16

    quantization_config = transformers.BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch_dtype,
            bnb_4bit_quant_storage=quant_storage_dtype,
        )

    model = transformers.AutoModelForCausalLM.from_pretrained(
        training_config['model_name_or_path'],
        quantization_config=quantization_config,
        attn_implementation="flash_attention_2", # use sdpa, alternatively use "flash_attention_2"
        torch_dtype=quant_storage_dtype,
        use_cache=False if training_config['gradient_checkpointing'] else True,  # this is needed for gradient checkpointing
    )

    if training_config['gradient_checkpointing']:
        model.gradient_checkpointing_enable()

    lora_config = LoraConfig(
        r=training_config['lora_r'],
        target_modules="all-linear",
        task_type=TaskType.CAUSAL_LM,
        lora_alpha=training_config['lora_alpha'],
        lora_dropout=0.05
    )

    training_arguments = SFTConfig(
        save_strategy='epoch',
        # save_steps=training_config['save_steps'],
        ddp_find_unused_parameters=False,
        gradient_checkpointing=training_config['gradient_checkpointing'],
        per_device_train_batch_size=training_config['per_device_train_batch_size'],
        gradient_accumulation_steps=training_config['gradient_accumulation_steps'],
        num_train_epochs=training_config['num_train_epochs'],
        learning_rate=training_config['learning_rate'],
        warmup_ratio=training_config['warmup_ratio'],
        lr_scheduler_type="cosine",
        bf16=True,
        tf32=True,
        max_steps=training_config['max_steps'],
        logging_steps=training_config['logging_steps'],
        output_dir=training_config['output_dir'],
        gradient_checkpointing_kwargs={'use_reentrant':False},
        max_seq_length=training_config['max_seq_len'],
        use_liger=training_config['use_liger'],
        dataset_text_field='text',
        packing=False,
        fsdp="full_shard auto_wrap offload",
        fsdp_config={
            "backward_prefetch" : "backward_pre",
            "forward_prefetch" : "false",
            "use_orig_params" : "false",
            "activation_checkpointing" : "true",
        }
    )

    trainer = SFTTrainer(
        model=model,
        args=training_arguments,
        train_dataset=train_dataset,
        peft_config=lora_config,
    )
    if training_config['resume']:
        trainer.train(resume_from_checkpoint=True)
    else:
        trainer.train()

from accelerate import notebook_launcher

notebook_launcher(fsdp_train, num_processes=8, mixed_precision='bf16', use_port='12345')
BenjaminBossan commented 1 day ago

Thanks @nivibilla. I assume you're on the latest versions of the relevant libraries (PEFT, accelerate, transformers)?

With your setting, I'm not sure if we'll get fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP, which I believe is necessary for QLoRA FSDP training to work correctly. Could you please verify that?

Another thing you could try is to coerce all LoRA modules to bfloat16. For this, after initializing the trainer, you'd have to call something like:

for name, module in model.named_modules():
    if "lora_" in name:
        module.to(torch.bfloat16)

Normally this shouldn't be necessary but if it helps, we learn more about the source of the issue.

nivibilla commented 1 day ago

Thanks @BenjaminBossan

for the fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP, can i just add it like this?

        fsdp_config={
            "backward_prefetch" : "backward_pre",
            "forward_prefetch" : "false",
            "use_orig_params" : "false",
            "activation_checkpointing" : "true",
           "fsdp_auto_wrap_policy" : "TRANSFORMER_BASED_WRAP"
        }

Im using databricks so I prefer to use the notebook launcher if possible

wizeng23 commented 11 hours ago

I have the same error trying to do QLora FSDP for meta-llama/Llama-3.2-3B-Instruct. I'm using the latest package versions: pip install accelerate==1.0.0 transformers==4.45.2 trl==0.11.3 peft==0.13.1 bitsandbytes==0.44.1.

I tried the solution proposed by @BenjaminBossan, but it didn't resolve the issue. However, trying to coerce all modules to bf16 seems to bypass the issue:

for name, module in model.named_modules():
    try:
        module.to(torch.bfloat16)
    except Exception as e:
        pass

Even though it doesn't trigger the error, something else seems to be broken, as the training stalls until it eventually times out. Specifically, it will print out the start of the WandB logs but not print out the tqdm training progress bar. During this time, GPU memory consumption doesn't change according to nvidia-smi.

wandb: Currently logged in as: ... (...). Use `wandb login --relogin` to force relogin
wandb: wandb version 0.18.3 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.17.7
wandb: Run data is saved locally in ...
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run ...
wandb: ⭐️ View project at ...
wandb: 🚀 View run at ...
# No tqdm bar :(

Strangely, meta-llama/Meta-Llama-3.1-8B-Instruct and meta-llama/Meta-Llama-3.1-70B-Instruct have no issue training with the module dtype coercion, only 3.2 3B.


I saw the second issue of a stalled training run when trying to run run_peft_qlora_fsdp.sh, which is referenced in HuggingFace's documentation page on QLora FSDP. Note that this issue seems to occur in this script with other models like Llama 2 7B/70B. However, the issue is resolved here if I use the minimum required package versions mentioned in the docs, i.e. pip install accelerate==0.28.0 transformers==4.39.0 trl==0.8.0 peft==0.10.0 bitsandbytes==0.43.0.

Thanks to playing a lot of pypi version hopscotch, the offending change seems to be in transformers between versions 4.44.2 and 4.45.0. That is, using the former runs QLora FSDP correctly, and using the latter results in a stalled training job. Applying this to my own code, using transformers==4.44.2 and the code snippet earlier in my comment seem to allow me to QLora FSDP tune Llama 3.2 3B.

However, the first issue seems present even when I used the minimum required package versions. Take this with a grain of salt, as I was only able to run haphazard tests; my codebase had several incompatibilities with older HF package versions.


In summary, this seems to suggest to me that there's two issues here which might not be related:

  1. Non-uniform dtypes. The temporary patch is to coerce all modules to the desired dtype. Given this issue doesn't occur in the reference script, perhaps it's an issue with my code or FSDP config?
  2. Stalled training job, for Llama 3.2 3B in my codebase and most models in the reference script. Fix is to set transformers<=4.44.2.

Any insights into either of these issues? Please LMK if I need to file issues in other repos as well. Thanks!

BenjaminBossan commented 5 hours ago

@nivibilla: Yes, I think it should be possible like that.

@wizeng23 Thanks for your detailed report. Based on that, I ran my own experiments. What I found:

When using transformers 4.44.2, I can train a Llama model (tested meta-llama/Llama-2-7b-hf) but the next transformers version, 4.45.0, fails with

ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32.

(Note that the tokenizers version also needs to be changed, but that's probably not the cause)

When checking where the float32 params come from, those are indeed the LoRA weights, but only on rank 1, while rank 0 is all bfloat16. When going to 4.44.2, the dtype is bfloat16 on all ranks. This explains why your coercion code fixes the issue.

Normally, SFTTrainer should take care of ensuring that LoRA weights are initialized with bfloat16. This depends on a variable called is_sharded_qlora, that is determined here:

https://github.com/huggingface/trl/blob/70036bf87f1036ef43b92cc421f7a3049debb1ec/trl/trainer/sft_trainer.py#L242-L250

When I check this variable:

This should not happen, it needs to always be True when using QLoRA + FSDP. I don't know if the issue lies with transformers or with trl. I'll ping some colleagues to hopefully figure this out.

Regarding the issue with Llama 3.2 3B, I didn't have time to look into that yet, but let's first try to resolve this fundamental issue.

Edit

I also tried the latest transformers version (144852fb) and there I got a really really weird error:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/name/work/forks/peft/examples/sft/train.py", line 184, in <module>
[rank1]:     main(model_args, data_args, training_args)
[rank1]:   File "/home/name/work/forks/peft/examples/sft/train.py", line 107, in main
[rank1]:     model, peft_config, tokenizer = create_and_prepare_model(model_args, data_args, training_args)
[rank1]:                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/work/forks/peft/examples/sft/utils.py", line 173, in create_and_prepare_model
[rank1]:     model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=8)
[rank1]:   File "/home/name/work/clones/transformers/src/transformers/modeling_utils.py", line 2087, in resize_token_embeddings
[rank1]:     model_embeds = self._resize_token_embeddings(new_num_tokens, pad_to_multiple_of, mean_resizing)
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/work/clones/transformers/src/transformers/modeling_utils.py", line 2112, in _resize_token_embeddings
[rank1]:     new_embeddings = self._get_resized_embeddings(
[rank1]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/work/clones/transformers/src/transformers/modeling_utils.py", line 2266, in _get_resized_embeddings
[rank1]:     self._init_added_embeddings_weights_with_mean(
[rank1]:   File "/home/name/work/clones/transformers/src/transformers/modeling_utils.py", line 2446, in _init_added_embeddings_weights_with_mean
[rank1]:     (covariance == covariance.T).all() and not torch.is_complex(eigenvalues) and (eigenvalues > 0).all()
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/_meta_registrations.py", line 6054, in meta_local_scalar_dense
[rank1]:     raise RuntimeError("Tensor.item() cannot be called on meta tensors")
[rank1]: RuntimeError: Tensor.item() cannot be called on meta tensors

I have hopes that this will be resolved with the same fix, so I'd say we can ignore it for now :crossed_fingers:

wizeng23 commented 1 hour ago

Thanks for the analysis @BenjaminBossan! I'll just use the dtype coercion temporary fix for now while waiting for the root fix. If only the LoRA weights are float32, then your coercion code should also work right? Since that didn't work for me, I'm wondering if something else in the model is also float32.

wizeng23 commented 10 minutes ago

Also, in my codebase, reverting to transformers==4.44.2 doesn't resolve the non-uniform dtype issue. I tested it with Llama 2 7B, Llama 3.1 8B, and Llama 3.2 3B.