Open JamesDConley opened 1 year ago
Please fill the issue templates for the specific bug in Accelerate or close this. There is no point opening an issue in different repos if it's all the same.
Please fill the issue templates for the specific bug in Accelerate or close this. There is no point opening an issue in different repos if it's all the same.
I linked this because someone else opened an issue in PEFT but the issue appears to actually be in accelerate, so wanted to make sure the right eyes see this issue.
Here's the full original issue in PEFT
The examples/conditional_generation/peft_lora_seq2seq_accelerate_fsdp.py appears to be incompatible with FSDP.
It appears others have also noticed the issue. https://github.com/h2oai/h2o-llmstudio/issues/98
I've created a stripped down version that I run with the accelerate launcher
accelerate launch train.py
My launcher config
- `Accelerate` version: 0.18.0
- Platform: Linux-6.1.24-x86_64-with-glibc2.37
- Python version: 3.10.10
- Numpy version: 1.24.3
- PyTorch version (GPU?): 1.13.1+cu117 (True)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: FSDP
- mixed_precision: fp16
- use_cpu: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- main_process_ip: 0.0.0.0
- main_process_port: 8080
- rdzv_backend: static
- same_network: True
- main_training_function: main
- fsdp_config: {'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch_policy': 'BACKWARD_PRE', 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 1, 'fsdp_state_dict_type': 'FULL_STATE_DICT', 'fsdp_transformer_layer_cls_to_wrap': 'T5Block'}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
import torch
from accelerate import Accelerator
from transformers import (
AutoModelForSeq2SeqLM,
AutoTokenizer,
get_linear_schedule_with_warmup,
)
from peft import LoraConfig, TaskType, get_peft_model
from peft.utils.other import fsdp_auto_wrap_policy
def main():
accelerator = Accelerator()
model_name_or_path = "t5-base"
lr = 1e-3
num_epochs = 1
peft_config = LoraConfig(
task_type=TaskType.SEQ_2_SEQ_LM,
inference_mode=False,
r=8,
lora_alpha=32,
lora_dropout=0.1,
)
model = AutoModelForSeq2SeqLM.from_pretrained(
model_name_or_path,
torch_dtype=torch.float16,
)
model = get_peft_model(model, peft_config)
accelerator.print(model.print_trainable_parameters())
AutoTokenizer.from_pretrained(model_name_or_path)
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
lr_scheduler = get_linear_schedule_with_warmup(
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=(8 * num_epochs),
)
if getattr(accelerator.state, "fsdp_plugin", None) is not None:
accelerator.state.fsdp_plugin.auto_wrap_policy = fsdp_auto_wrap_policy(model)
(
model,
optimizer,
lr_scheduler,
) = accelerator.prepare(model, optimizer, lr_scheduler)
accelerator.print(model)
if __name__ == "__main__":
main()
The full stacktrace is as follows:
accelerate launch --config_file ./finetune/launcher_configs/accelerate_fsdp_no_offload_config.yaml ./finetune/peft_lora_seq2seq_accelerate_fsdp.py t5-base
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
CUDA SETUP: CUDA runtime path found: /nix/store/0781hi5c3vb0v7h0s701adqgg4531qib-cuda-home/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
CUDA SETUP: CUDA runtime path found: /nix/store/0781hi5c3vb0v7h0s701adqgg4531qib-cuda-home/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
trainable params: 884736 || all params: 223788288 || trainable%: 0.3953450861557152
trainable params: 884736 || all params: 223788288 || trainable%: 0.3953450861557152
None
/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/transformers/models/t5/tokenization_t5_fast.py:155: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
warnings.warn(
/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/transformers/models/t5/tokenization_t5_fast.py:155: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
warnings.warn(
FSDP Warning: When using FSDP, it is efficient and recommended to call prepare for the model before creating the optimizer
Traceback (most recent call last):
File "/home/markh/text-fine-tuning-experiments/./finetune/peft_lora_seq2seq_accelerate_fsdp.py", line 54, in <module>
main()
File "/home/markh/text-fine-tuning-experiments/./finetune/peft_lora_seq2seq_accelerate_fsdp.py", line 49, in main
) = accelerator.prepare(model, optimizer, lr_scheduler)
File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1122, in prepare
result = tuple(
File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1123, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 977, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1227, in prepare_model
model = FSDP(model, **kwargs)
File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1036, in __init__
self._auto_wrap(auto_wrap_kwargs, fsdp_kwargs)
File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1291, in _auto_wrap
_recursive_wrap(**auto_wrap_kwargs, **fsdp_kwargs)
File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 403, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 403, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 403, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
[Previous line repeated 2 more times]
File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 421, in _recursive_wrap
return _wrap(module, wrapper_cls, **kwargs), num_params
File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 350, in _wrap
return wrapper_cls(module, **kwargs)
File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1079, in __init__
self._fsdp_wrapped_module = FlattenParamsWrapper(
File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/flatten_params_wrapper.py", line 103, in __init__
self._flat_param_handle = FlatParamHandle(params, module, device, config)
File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/flat_param.py", line 270, in __init__
self._init_flat_param(params, module)
File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/flat_param.py", line 330, in _init_flat_param
raise ValueError(
ValueError: `FlatParameter` requires uniform dtype but got torch.float16 and torch.float32
My own example/details
Relevant Package Versions
torch==2.0.1
torchaudio==2.0.2+cu118
torchvision==0.15.2+cu118
peft==0.3.0
accelerate==0.20.3
Using base container nvidia/cuda:11.8.0-devel-ubuntu22.04 in docker on a linux box with 2x A6000 GPUs running ubuntu 22.04
When using custom scripts
My own custom task/dataset, although it fails preparing the model before training so that's not relevant. I've stripped out the dataset code for the minimal example and just passed None for simplicity. It raises the same error in either case.
import torch
import logging
from accelerate import Accelerator
from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoTokenizer, AutoModelForCausalLM, AdamW, get_constant_schedule_with_warmup
logging.basicConfig(level=logging.INFO, format="%(asctime)s.%(msecs)03d %(levelname)s: %(message)s", datefmt="%Y-%m-%d %H:%M:%S",)
logger = logging.getLogger(__name__)
accelerator = Accelerator()
BASE_MODEL = "tiiuae/falcon-7b"
USE_GRADIENT_CHECKPOINTING = True
TARGET_MODULES = ["query_key_value"]
# Setup Tokenizer
logger.info("Setting up Tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, cache_dir="/app/models/hface_cache", use_auth_token=None)
tokenizer.pad_token = tokenizer.eos_token
# Setup Model
logger.info("Setting up Model...")
model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL, cache_dir="/app/models/hface_cache", use_cache=False, torch_dtype=torch.float16, use_auth_token=None, trust_remote_code=True)
# Setup PEFT
logger.info("Setting up PEFT Config...")
peft_config = LoraConfig(
task_type=TaskType.CAUSAL_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1, target_modules=TARGET_MODULES
)
model.enable_input_require_grads()
logger.info("Converting Model to PEFT...")
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
if USE_GRADIENT_CHECKPOINTING:
logger.info("Enabling Gradient Checkpointing...")
model.gradient_checkpointing_enable()
_ = model.train()
# Prepare for training
# Setup optimizer & learning rate scheduler
opt = AdamW(model.parameters(), lr=0.0001)
scheduler = get_constant_schedule_with_warmup(opt, num_warmup_steps=100)
# Accelerate Components
logger.info("Wrapping objects with accelerate...")
model, opt, _, scheduler = accelerator.prepare(
model, opt, None, scheduler
)
root@805a1946b2f3:/app# accelerate config
-----------------------------------------------------------------------------------------------------------------------------------------In which compute environment are you running?
This machine
-----------------------------------------------------------------------------------------------------------------------------------------Which type of machine are you using?
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]:
Do you wish to optimize your script with torch dynamo?[yes/NO]:
Do you want to use DeepSpeed? [yes/NO]:
Do you want to use FullyShardedDataParallel? [yes/NO]: yes
-----------------------------------------------------------------------------------------------------------------------------------------What should be your sharding strategy?
FULL_SHARD
Do you want to offload parameters and gradients to CPU? [yes/NO]: yes
-----------------------------------------------------------------------------------------------------------------------------------------What should be your auto wrap policy?
TRANSFORMER_BASED_WRAP
Specify the comma-separated list of transformer layer class names (case-sensitive) to wrap ,e.g, :`BertLayer`, `GPTJBlock`, `T5Block`, `BertLayer,BertEmbeddings,BertSelfOutput` ...? : DecoderLayer
-----------------------------------------------------------------------------------------------------------------------------------------What should be your FSDP's backward prefetch policy?
BACKWARD_PRE
-----------------------------------------------------------------------------------------------------------------------------------------What should be your FSDP's state dict type?
FULL_STATE_DICT
How many GPU(s) should be used for distributed training? [1]:2
-----------------------------------------------------------------------------------------------------------------------------------------Do you wish to use FP16 or BF16 (mixed precision)?
fp16
accelerate configuration saved at /root/.cache/huggingface/accelerate/default_config.yaml
root@805a1946b2f3:/app# accelerate launch src/minimal_example.py
2023-06-22 03:50:07.968 INFO: Created a temporary directory at /tmp/tmpe4jhd31z
2023-06-22 03:50:07.968 INFO: Created a temporary directory at /tmp/tmpcrhlvm3j
2023-06-22 03:50:07.968 INFO: Writing /tmp/tmpe4jhd31z/_remote_module_non_scriptable.py
2023-06-22 03:50:07.968 INFO: Writing /tmp/tmpcrhlvm3j/_remote_module_non_scriptable.py
2023-06-22 03:50:08.031 INFO: Added key: store_based_barrier_key:1 to store for rank: 1
2023-06-22 03:50:08.031 INFO: Added key: store_based_barrier_key:1 to store for rank: 0
2023-06-22 03:50:08.031 INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023-06-22 03:50:08.031 INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023-06-22 03:50:08.048 INFO: Setting up Tokenizer...
2023-06-22 03:50:08.048 INFO: Setting up Tokenizer...
2023-06-22 03:50:08.289 INFO: Setting up Model...
2023-06-22 03:50:08.290 INFO: Setting up Model...
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00, 6.07s/it]
2023-06-22 03:52:19.813 INFO: Setting up PEFT Config...
2023-06-22 03:52:19.813 INFO: Converting Model to PEFT...
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00, 6.18s/it]
2023-06-22 03:52:22.155 INFO: Setting up PEFT Config...
2023-06-22 03:52:22.156 INFO: Converting Model to PEFT...
trainable params: 2359296 || all params: 6924080000 || trainable%: 0.03407378308742822
2023-06-22 03:52:25.845 INFO: Enabling Gradient Checkpointing...
/usr/local/lib/python3.9/dist-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
2023-06-22 03:52:25.849 INFO: Wrapping objects with accelerate...
Traceback (most recent call last):
File "/app/src/minimal_example.py", line 87, in <module>
model, opt, _, scheduler = accelerator.prepare(
File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1182, in prepare
result = tuple(
File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1183, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1022, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1300, in prepare_model
model = FSDP(model, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 391, in __init__
_auto_wrap(auto_wrap_kwargs, fsdp_kwargs, FullyShardedDataParallel)
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 73, in _auto_wrap
_recursive_wrap(**auto_wrap_kwargs, **fsdp_kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
[Previous line repeated 2 more times]
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/wrap.py", line 388, in _recursive_wrap
return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/wrap.py", line 317, in _wrap
return wrapper_cls(module, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 408, in __init__
_init_param_handle_from_module(
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/_init_utils.py", line 429, in _init_param_handle_from_module
_init_param_handle_from_params(state, managed_params, fully_sharded_module)
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/_init_utils.py", line 525, in _init_param_handle_from_params
handle = FlatParamHandle(
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/flat_param.py", line 366, in __init__
self._init_flat_param(params, fully_sharded_module, use_orig_params)
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/flat_param.py", line 430, in _init_flat_param
raise ValueError(
ValueError: `FlatParameter` requires uniform dtype but got torch.float16 and torch.float32
trainable params: 2359296 || all params: 6924080000 || trainable%: 0.03407378308742822
2023-06-22 03:52:28.238 INFO: Enabling Gradient Checkpointing...
/usr/local/lib/python3.9/dist-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
2023-06-22 03:52:28.242 INFO: Wrapping objects with accelerate...
2023-06-22 03:52:28.242 WARNING: FSDP Warning: When using FSDP, it is efficient and recommended to call prepare for the model before creating the optimizer
Traceback (most recent call last):
File "/app/src/minimal_example.py", line 87, in <module>
model, opt, _, scheduler = accelerator.prepare(
File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1182, in prepare
result = tuple(
File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1183, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1022, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1300, in prepare_model
model = FSDP(model, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 391, in __init__
_auto_wrap(auto_wrap_kwargs, fsdp_kwargs, FullyShardedDataParallel)
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 73, in _auto_wrap
_recursive_wrap(**auto_wrap_kwargs, **fsdp_kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
[Previous line repeated 2 more times]
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/wrap.py", line 388, in _recursive_wrap
return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/wrap.py", line 317, in _wrap
return wrapper_cls(module, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 408, in __init__
_init_param_handle_from_module(
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/_init_utils.py", line 429, in _init_param_handle_from_module
_init_param_handle_from_params(state, managed_params, fully_sharded_module)
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/_init_utils.py", line 525, in _init_param_handle_from_params
handle = FlatParamHandle(
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/flat_param.py", line 366, in __init__
self._init_flat_param(params, fully_sharded_module, use_orig_params)
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/fsdp/flat_param.py", line 430, in _init_flat_param
raise ValueError(
ValueError: `FlatParameter` requires uniform dtype but got torch.float16 and torch.float32
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2139) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 928, in launch_command
multi_gpu_launcher(args)
File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 627, in multi_gpu_launcher
distrib_run.run(args)
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
src/minimal_example.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-06-22_03:52:31
host : 805a1946b2f3
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2140)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-06-22_03:52:31
host : 805a1946b2f3
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2139)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
root@805a1946b2f3:/app#
The model is loaded with FSDP across the 2 GPUs without crashing
Hello, FSDP with PEFT isn't leading to any memory savings when compared to plain pytorch. see this https://github.com/pytorch/pytorch/issues/91165#issuecomment-160080533, It also shows how to use FSDP with PEFT nonetheless.
Hello, FSDP with PEFT isn't leading to any memory savings when compared to plain pytorch. see this pytorch/pytorch#91165 (comment), It also shows how to use FSDP with PEFT nonetheless.
Thanks for the heads up. I tested with Deepspeed ZeRo (3) last night and managed to get Falcon-40B training working so I'll continue on with that instead :+1:
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I got the same error even after pre-casting all modules' parameters to be torch.float16
. Any update on this issue?
Hello, FSDP with PEFT isn't leading to any memory savings when compared to plain pytorch. see this pytorch/pytorch#91165 (comment), It also shows how to use FSDP with PEFT nonetheless.
Thanks for the heads up. I tested with Deepspeed ZeRo (3) last night and managed to get Falcon-40B training working so I'll continue on with that instead 👍
Hi bro, how did you fix that? Still stuck with the error ValueError: FlatParameter requires uniform dtype but got torch.float16 and torch.float32
same issue here, seems fsdp aint playing nice with peft.
The error message I see is slightly different:
ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32
But, I think it's the same issue other folks on here seem to be facing. This happens when use the fsdp
& fsdp_config
params in TrainingArguments
, so I'm not explicitly using Accelerate
, but it is being used under the hood nevertheless.
set FSDP_CPU_RAM_EFFICIENT_LOADING=1 solve the problem...
I tried launching the script with FSDP_CPU_RAM_EFFICIENT_LOADING=1
but didn't work . Having same issue.
This is the blog I am following.
My command:
FSDP_CPU_RAM_EFFICIENT_LOADING=1 torchrun --nproc_per_node=4 run_fsdp_qlora.py --config config.yaml
These are the libraries:
%pip install --quiet \
"torch==2.2.2" tensorboard
# Install Hugging Face libraries
%pip install --upgrade --quiet \
"transformers==4.40.0" "datasets==2.18.0" "accelerate==0.29.3" "evaluate==0.4.1" "bitsandbytes==0.43.1" "huggingface_hub==0.22.2" "trl==0.8.6" "peft==0.10.0"
Any suggestions how to solve or further investigate the issue ? Is there any specific library version I am missing ?
Reopening as came across this myself. Correct me if I'm wrong, have we enabled any max_grad_norm
?
Setting it to 0 manually "fixes" this, as the issue comes from doing FSDP + grad norm. I'll check with the PyTorch team to see what we can do to fix this.
Hi @muellerzr Did you find any solution? I am also facing the same issue. I am using accelerate==0.30.1 and no max_grad_norm.
Hi @muellerzr Did you find any solution? I am also facing the same issue. I am using accelerate==0.30.1 and no max_grad_norm.
Not sure if it is the same issue. In my case, I used the sample code created by Schmid (https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/fsdp-qlora-distributed-llama3.ipynb)
When I used newer transformers lib >= 4.41.0, I encountered the error. I looked at the changes between 4.40.2 and 4.41.0, I found this changeset https://github.com/huggingface/transformers/commit/f16caf44bb1606652ac6c7c4ad4bf44973d4e545.
Then I was able to make the code work again by add the "cpu_ram_efficient_loading" to the fsdp_config. ie.
fsdp_config: backward_prefetch: "backward_pre" forward_prefetch: "false" use_orig_params: "false" cpu_ram_efficient_loading: "true" ## NEWLY ADDED sync_module_states: "true"
Hi @tle211212 Thanks for your suggestion. I did try setting "cpu_ram_efficient_loading" to true in fsdp_config and don't get the tensor type mismatch error now.
I am using transformer lib ==4.42.4 and torch==2.3.1
fsdp_config: backward_prefetch: "backward_pre" forward_prefetch: "false" use_orig_params: "false" limit_all_gathers: "true" sync_module_states: "true" cpu_ram_efficient_loading: "true"
However, I am getting another error: output tensor size must be equal to world_size times input tensor size
Command: ACCELERATE_USE_FSDP=1 CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=1 --nproc_per_node=4 model.py --config mistral_qlora_fsdp.yaml
File "/home/jupyter/model_phil.py", line 189, in <module>
[rank2]: training_function(script_args, training_args)
[rank2]: File "/home/jupyter/model_phil.py", line 169, in training_function
[rank2]: trainer.train()
[rank2]: File "/opt/conda/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 361, in train
[rank2]: output = super().train(*args, **kwargs)
[rank2]: File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in train
[rank2]: return inner_training_loop(
[rank2]: File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2345, in _inner_training_loop
[rank2]: self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
[rank2]: File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2796, in _maybe_log_save_evaluate
[rank2]: self._save_checkpoint(model, trial, metrics=metrics)
[rank2]: File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2879, in _save_checkpoint
[rank2]: self._save_optimizer_and_scheduler(output_dir)
[rank2]: File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2990, in _save_optimizer_and_scheduler
[rank2]: save_fsdp_optimizer(
[rank2]: File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/fsdp_utils.py", line 157, in save_fsdp_optimizer
[rank2]: optim_state = FSDP.optim_state_dict(model, optimizer)
[rank2]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1840, in optim_state_dict
[rank2]: return FullyShardedDataParallel._optim_state_dict_impl(
[rank2]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1263, in _optim_state_dict_impl
[rank2]: return _optim_state_dict(
[rank2]: File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank2]: return func(*args, **kwargs)
[rank2]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1971, in _optim_state_dict
[rank2]: fsdp_osd_state = convert_fn(
[rank2]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1834, in _convert_state_with_flat_params
[rank2]: unflat_state = _unflatten_optim_state(
[rank2]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 160, in _unflatten_optim_state
[rank2]: consolidated_state = _communicate_optim_state(
[rank2]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 239, in _communicate_optim_state
[rank2]: dist.all_gather_into_tensor(
[rank2]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank2]: return func(*args, **kwargs)
[rank2]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2948, in all_gather_into_tensor
[rank2]: work = group._allgather_base(output_tensor, input_tensor, opts)
[rank2]: ValueError: output tensor size must be equal to world_size times input tensor size
Any solution/suggestion to fix this. Thanks.
Saddly to find this bug has not been fixed after more than one year.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
No stale. The BUG is still there. FSDP is a great tool for large and long context training, please fix it. Latest all libs installed.
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
fsdp_activation_checkpointing: false
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: false
fsdp_offload_params: true
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
ERROR
File "/root/yarn/finetune.py", line 525, in <module>
[rank0]: main(args.parse_args())
[rank0]: File "/root/yarn/finetune.py", line 367, in main
[rank0]: model = accelerator.prepare(model)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1326, in prepare
[rank0]: result = tuple(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1327, in <genexpr>
[rank0]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1200, in _prepare_one
[rank0]: return self.prepare_model(obj, device_placement=device_placement)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1484, in prepare_model
[rank0]: model = FSDP(model, **kwargs)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 483, in __init__
[rank0]: _auto_wrap(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 102, in _auto_wrap
[rank0]: _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs) # type: ignore[arg-type]
[rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 544, in _recursive_wrap
[rank0]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 544, in _recursive_wrap
[rank0]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 544, in _recursive_wrap
[rank0]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank0]: [Previous line repeated 2 more times]
[rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 562, in _recursive_wrap
[rank0]: return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
[rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 491, in _wrap
[rank0]: return wrapper_cls(module, **kwargs)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 509, in __init__
[rank0]: _init_param_handle_from_module(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 603, in _init_param_handle_from_module
[rank0]: _init_param_handle_from_params(state, managed_params, fully_sharded_module)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 615, in _init_param_handle_from_params
[rank0]: handle = FlatParamHandle(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 583, in __init__
[rank0]: self._init_flat_param_and_metadata(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 633, in _init_flat_param_and_metadata
[rank0]: ) = self._validate_tensors_to_flatten(params)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 771, in _validate_tensors_to_flatten
[rank0]: raise ValueError(
[rank0]: ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32
Thanks, Steve
+1
Tested with fsdp with qlora on qwen 7b using accelerate launcher.
Launching training on 8 GPUs.
2024-10-09 14:21:18.291308: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-09 14:21:18.351577: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-09 14:21:18.353145: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-09 14:21:18.353147: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-09 14:21:18.385038: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-09 14:21:18.387633: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-09 14:21:18.387653: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-09 14:21:18.428540: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
When using FSDP full shard, instead of using `gradient_checkpointing` in TrainingArguments, please use `activation_checkpointing` in `fsdp_config`. The former introduces a redundant AllGather operation in backward pass. Reference: https://github.com/huggingface/transformers/issues/30404
When using FSDP full shard, instead of using `gradient_checkpointing` in TrainingArguments, please use `activation_checkpointing` in `fsdp_config`. The former introduces a redundant AllGather operation in backward pass. Reference: https://github.com/huggingface/transformers/issues/30404
When using FSDP full shard, instead of using `gradient_checkpointing` in TrainingArguments, please use `activation_checkpointing` in `fsdp_config`. The former introduces a redundant AllGather operation in backward pass. Reference: https://github.com/huggingface/transformers/issues/30404
When using FSDP full shard, instead of using `gradient_checkpointing` in TrainingArguments, please use `activation_checkpointing` in `fsdp_config`. The former introduces a redundant AllGather operation in backward pass. Reference: https://github.com/huggingface/transformers/issues/30404
When using FSDP full shard, instead of using `gradient_checkpointing` in TrainingArguments, please use `activation_checkpointing` in `fsdp_config`. The former introduces a redundant AllGather operation in backward pass. Reference: https://github.com/huggingface/transformers/issues/30404
When using FSDP full shard, instead of using `gradient_checkpointing` in TrainingArguments, please use `activation_checkpointing` in `fsdp_config`. The former introduces a redundant AllGather operation in backward pass. Reference: https://github.com/huggingface/transformers/issues/30404
When using FSDP full shard, instead of using `gradient_checkpointing` in TrainingArguments, please use `activation_checkpointing` in `fsdp_config`. The former introduces a redundant AllGather operation in backward pass. Reference: https://github.com/huggingface/transformers/issues/30404
When using FSDP full shard, instead of using `gradient_checkpointing` in TrainingArguments, please use `activation_checkpointing` in `fsdp_config`. The former introduces a redundant AllGather operation in backward pass. Reference: https://github.com/huggingface/transformers/issues/30404
WARNING:root:No handler for 1e2b71a0a97c482db4ccfc57a77d2fcc
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
[2024-10-09 14:22:01,370] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-09 14:22:01,390] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-09 14:22:01,396] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-09 14:22:01,397] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-09 14:22:01,397] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-09 14:22:01,422] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[2024-10-09 14:22:01,520] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] async_io: please install the libaio-dev package with apt [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[2024-10-09 14:22:01,636] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -laio: No such file or directory
/usr/bin/ld: cannot find -laio: No such file or directory
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
collect2: error: ld returned 1 exit status
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
W1009 14:22:02.871075 140470744776704 torch/multiprocessing/spawn.py:145] Terminating process 63148 via signal SIGTERM
W1009 14:22:02.872952 140470744776704 torch/multiprocessing/spawn.py:145] Terminating process 63150 via signal SIGTERM
W1009 14:22:02.873525 140470744776704 torch/multiprocessing/spawn.py:145] Terminating process 63151 via signal SIGTERM
W1009 14:22:02.873933 140470744776704 torch/multiprocessing/spawn.py:145] Terminating process 63152 via signal SIGTERM
W1009 14:22:02.874345 140470744776704 torch/multiprocessing/spawn.py:145] Terminating process 63153 via signal SIGTERM
W1009 14:22:02.874789 140470744776704 torch/multiprocessing/spawn.py:145] Terminating process 63154 via signal SIGTERM
W1009 14:22:02.875156 140470744776704 torch/multiprocessing/spawn.py:145] Terminating process 63155 via signal SIGTERM
W1009 14:22:32.908281 140470744776704 torch/multiprocessing/spawn.py:153] Unable to shutdown process 63148 via SIGTERM , forcefully exiting via SIGKILL
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] failed (exitcode: 1) local_rank: 1 (pid: 63149) of fn: fsdp_train (start_method: fork)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] Traceback (most recent call last):
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/databricks/python/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 656, in _poll
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] self._pc.join(-1)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/databricks/python/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 188, in join
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] raise ProcessRaisedException(msg, error_index, failed_process.pid)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] torch.multiprocessing.spawn.ProcessRaisedException:
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] -- Process 1 terminated with the following error:
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] Traceback (most recent call last):
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/databricks/python/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 75, in _wrap
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] fn(i, *args)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/databricks/python/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 580, in _wrap
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] ret = record(fn)(*args_)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] ^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/databricks/python/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] return f(*args, **kwargs)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] ^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/root/.ipykernel/62919/command-1189292846020885-496091015", line 108, in fsdp_train
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] trainer.train()
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-526a0c7c-9fb3-498c-be7d-39bbf80f2668/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 434, in train
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] output = super().train(*args, **kwargs)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/databricks/python/lib/python3.11/site-packages/mlflow/utils/autologging_utils/safety.py", line 460, in safe_patch_function
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] return original(*args, **kwargs)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] ^^^^^^^^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/databricks/python_shell/dbruntime/huggingface_patches/transformers.py", line 54, in patched_fit_function
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] model = original_method(self, *args, **kwargs)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/transformers/trainer.py", line 2052, in train
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] return inner_training_loop(
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] ^^^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/transformers/trainer.py", line 2194, in _inner_training_loop
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] self.model = self.accelerator.prepare(self.model)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/accelerate/accelerator.py", line 1326, in prepare
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] result = tuple(
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] ^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/accelerate/accelerator.py", line 1327, in <genexpr>
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/accelerate/accelerator.py", line 1200, in _prepare_one
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] return self.prepare_model(obj, device_placement=device_placement)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/accelerate/accelerator.py", line 1484, in prepare_model
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] model = FSDP(model, **kwargs)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] ^^^^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 485, in __init__
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] _auto_wrap(
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs) # type: ignore[arg-type]
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] wrapped_child, num_wrapped_params = _recursive_wrap(
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] ^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] wrapped_child, num_wrapped_params = _recursive_wrap(
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] ^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] wrapped_child, num_wrapped_params = _recursive_wrap(
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] ^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] [Previous line repeated 2 more times]
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] return wrapper_cls(module, **kwargs)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 511, in __init__
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] _init_param_handle_from_module(
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/_init_utils.py", line 598, in _init_param_handle_from_module
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] _init_param_handle_from_params(state, managed_params, fully_sharded_module)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/_init_utils.py", line 610, in _init_param_handle_from_params
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] handle = FlatParamHandle(
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] ^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/_flat_param.py", line 582, in __init__
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] self._init_flat_param_and_metadata(
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/_flat_param.py", line 632, in _init_flat_param_and_metadata
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] ) = self._validate_tensors_to_flatten(params)
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/_flat_param.py", line 770, in _validate_tensors_to_flatten
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] raise ValueError(
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695] ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32
E1009 14:22:34.369736 140470744776704 torch/distributed/elastic/multiprocessing/api.py:695]
ChildFailedError:
============================================================
fsdp_train FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-10-09_14:22:02
host : 0823-062625-tzq5t3e1-10-168-70-23
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 63149)
error_file: /tmp/torchelastic_db2i3dhw/none_gw9dnnic/attempt_0/1/error.json
traceback : Traceback (most recent call last):
File "/databricks/python/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/.ipykernel/62919/command-1189292846020885-496091015", line 108, in fsdp_train
trainer.train()
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-526a0c7c-9fb3-498c-be7d-39bbf80f2668/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 434, in train
output = super().train(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.11/site-packages/mlflow/utils/autologging_utils/safety.py", line 460, in safe_patch_function
return original(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/python_shell/dbruntime/huggingface_patches/transformers.py", line 54, in patched_fit_function
model = original_method(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/transformers/trainer.py", line 2052, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/transformers/trainer.py", line 2194, in _inner_training_loop
self.model = self.accelerator.prepare(self.model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/accelerate/accelerator.py", line 1326, in prepare
result = tuple(
^^^^^^
File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/accelerate/accelerator.py", line 1327, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/accelerate/accelerator.py", line 1200, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/accelerate/accelerator.py", line 1484, in prepare_model
model = FSDP(model, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 485, in __init__
_auto_wrap(
File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
_recursive_wrap(**recursive_wrap_kwargs, **root_kwargs) # type: ignore[arg-type]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
^^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
^^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
^^^^^^^^^^^^^^^^
[Previous line repeated 2 more times]
File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
return wrapper_cls(module, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 511, in __init__
_init_param_handle_from_module(
File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/_init_utils.py", line 598, in _init_param_handle_from_module
_init_param_handle_from_params(state, managed_params, fully_sharded_module)
File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/_init_utils.py", line 610, in _init_param_handle_from_params
handle = FlatParamHandle(
^^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/_flat_param.py", line 582, in __init__
self._init_flat_param_and_metadata(
File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/_flat_param.py", line 632, in _init_flat_param_and_metadata
) = self._validate_tensors_to_flatten(params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.11/site-packages/torch/distributed/fsdp/_flat_param.py", line 770, in _validate_tensors_to_flatten
raise ValueError(
ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32
============================================================
File <command-1189292846020892>, line 6
3 os.environ["ACCELERATE_USE_FSDP"] = '1'
4 os.environ["FSDP_CPU_RAM_EFFICIENT_LOADING"] = '1'
----> 6 notebook_launcher(fsdp_train, num_processes=8, mixed_precision='bf16', use_port='12345')
File /databricks/python/lib/python3.11/site-packages/torch/distributed/launcher/api.py:263, in launch_agent(config, entrypoint, args)
256 events.record(agent.get_event_succeeded())
258 if result.is_failed():
259 # ChildFailedError is treated specially by @record
260 # if the error files for the failed children exist
261 # @record will copy the first error (root cause)
262 # to the error file of the launcher process.
--> 263 raise ChildFailedError(
264 name=entrypoint_name,
265 failures=result.failures,
266 )
268 return result.return_values
269 except ChildFailedError:
@thusinh1969 are you also using LoRA/QLoRA or normal fine-tuning?
@nivibilla Could you please show your train script, or at the very least how the base model and PEFT model are initialized?
@BenjaminBossan sure
def fsdp_train():
from dataclasses import dataclass
import datasets
import torch
import transformers
from trl import SFTConfig, SFTTrainer
from peft import LoraConfig, TaskType, get_peft_model
import json
import os
os.environ["ACCELERATE_USE_FSDP"] = '1'
os.environ["FSDP_CPU_RAM_EFFICIENT_LOADING"] = '1'
with open('/local_disk0/training_config.json') as f:
training_config = json.load(f)
# # testing memory usage for batch size
training_config['max_steps'] = 50
# training_config['per_device_train_batch_size'] = 32
# print(json.dumps(training_config, indent=4))
tokenizer = transformers.AutoTokenizer.from_pretrained(
training_config['model_name_or_path'],
padding_side="left",
truncation_side="left",
)
tokenizer.pad_token = tokenizer.eos_token
train_dataset = datasets.load_from_disk('/local_disk0/train')
# Model
torch_dtype = torch.bfloat16
quant_storage_dtype = torch.bfloat16
quantization_config = transformers.BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch_dtype,
bnb_4bit_quant_storage=quant_storage_dtype,
)
model = transformers.AutoModelForCausalLM.from_pretrained(
training_config['model_name_or_path'],
quantization_config=quantization_config,
attn_implementation="flash_attention_2", # use sdpa, alternatively use "flash_attention_2"
torch_dtype=quant_storage_dtype,
use_cache=False if training_config['gradient_checkpointing'] else True, # this is needed for gradient checkpointing
)
if training_config['gradient_checkpointing']:
model.gradient_checkpointing_enable()
lora_config = LoraConfig(
r=training_config['lora_r'],
target_modules="all-linear",
task_type=TaskType.CAUSAL_LM,
lora_alpha=training_config['lora_alpha'],
lora_dropout=0.05
)
training_arguments = SFTConfig(
save_strategy='epoch',
# save_steps=training_config['save_steps'],
ddp_find_unused_parameters=False,
gradient_checkpointing=training_config['gradient_checkpointing'],
per_device_train_batch_size=training_config['per_device_train_batch_size'],
gradient_accumulation_steps=training_config['gradient_accumulation_steps'],
num_train_epochs=training_config['num_train_epochs'],
learning_rate=training_config['learning_rate'],
warmup_ratio=training_config['warmup_ratio'],
lr_scheduler_type="cosine",
bf16=True,
tf32=True,
max_steps=training_config['max_steps'],
logging_steps=training_config['logging_steps'],
output_dir=training_config['output_dir'],
gradient_checkpointing_kwargs={'use_reentrant':False},
max_seq_length=training_config['max_seq_len'],
use_liger=training_config['use_liger'],
dataset_text_field='text',
packing=False,
fsdp="full_shard auto_wrap offload",
fsdp_config={
"backward_prefetch" : "backward_pre",
"forward_prefetch" : "false",
"use_orig_params" : "false",
"activation_checkpointing" : "true",
}
)
trainer = SFTTrainer(
model=model,
args=training_arguments,
train_dataset=train_dataset,
peft_config=lora_config,
)
if training_config['resume']:
trainer.train(resume_from_checkpoint=True)
else:
trainer.train()
from accelerate import notebook_launcher
notebook_launcher(fsdp_train, num_processes=8, mixed_precision='bf16', use_port='12345')
Thanks @nivibilla. I assume you're on the latest versions of the relevant libraries (PEFT, accelerate, transformers)?
With your setting, I'm not sure if we'll get fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
, which I believe is necessary for QLoRA FSDP training to work correctly. Could you please verify that?
Another thing you could try is to coerce all LoRA modules to bfloat16. For this, after initializing the trainer
, you'd have to call something like:
for name, module in model.named_modules():
if "lora_" in name:
module.to(torch.bfloat16)
Normally this shouldn't be necessary but if it helps, we learn more about the source of the issue.
Thanks @BenjaminBossan
for the fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
, can i just add it like this?
fsdp_config={
"backward_prefetch" : "backward_pre",
"forward_prefetch" : "false",
"use_orig_params" : "false",
"activation_checkpointing" : "true",
"fsdp_auto_wrap_policy" : "TRANSFORMER_BASED_WRAP"
}
Im using databricks so I prefer to use the notebook launcher if possible
I have the same error trying to do QLora FSDP for meta-llama/Llama-3.2-3B-Instruct
. I'm using the latest package versions: pip install accelerate==1.0.0 transformers==4.45.2 trl==0.11.3 peft==0.13.1 bitsandbytes==0.44.1
.
I tried the solution proposed by @BenjaminBossan, but it didn't resolve the issue. However, trying to coerce all modules to bf16 seems to bypass the issue:
for name, module in model.named_modules():
try:
module.to(torch.bfloat16)
except Exception as e:
pass
Even though it doesn't trigger the error, something else seems to be broken, as the training stalls until it eventually times out. Specifically, it will print out the start of the WandB logs but not print out the tqdm training progress bar. During this time, GPU memory consumption doesn't change according to nvidia-smi
.
wandb: Currently logged in as: ... (...). Use `wandb login --relogin` to force relogin
wandb: wandb version 0.18.3 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.17.7
wandb: Run data is saved locally in ...
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run ...
wandb: ⭐️ View project at ...
wandb: 🚀 View run at ...
# No tqdm bar :(
Strangely, meta-llama/Meta-Llama-3.1-8B-Instruct
and meta-llama/Meta-Llama-3.1-70B-Instruct
have no issue training with the module dtype coercion, only 3.2 3B.
I saw the second issue of a stalled training run when trying to run run_peft_qlora_fsdp.sh, which is referenced in HuggingFace's documentation page on QLora FSDP. Note that this issue seems to occur in this script with other models like Llama 2 7B/70B. However, the issue is resolved here if I use the minimum required package versions mentioned in the docs, i.e. pip install accelerate==0.28.0 transformers==4.39.0 trl==0.8.0 peft==0.10.0 bitsandbytes==0.43.0
.
Thanks to playing a lot of pypi version hopscotch, the offending change seems to be in transformers
between versions 4.44.2
and 4.45.0
. That is, using the former runs QLora FSDP correctly, and using the latter results in a stalled training job. Applying this to my own code, using transformers==4.44.2
and the code snippet earlier in my comment seem to allow me to QLora FSDP tune Llama 3.2 3B.
However, the first issue seems present even when I used the minimum required package versions. Take this with a grain of salt, as I was only able to run haphazard tests; my codebase had several incompatibilities with older HF package versions.
In summary, this seems to suggest to me that there's two issues here which might not be related:
transformers<=4.44.2
.Any insights into either of these issues? Please LMK if I need to file issues in other repos as well. Thanks!
@nivibilla: Yes, I think it should be possible like that.
@wizeng23 Thanks for your detailed report. Based on that, I ran my own experiments. What I found:
When using transformers 4.44.2, I can train a Llama model (tested meta-llama/Llama-2-7b-hf
) but the next transformers version, 4.45.0, fails with
ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32
.
(Note that the tokenizers version also needs to be changed, but that's probably not the cause)
When checking where the float32 params come from, those are indeed the LoRA weights, but only on rank 1, while rank 0 is all bfloat16. When going to 4.44.2, the dtype is bfloat16 on all ranks. This explains why your coercion code fixes the issue.
Normally, SFTTrainer
should take care of ensuring that LoRA weights are initialized with bfloat16. This depends on a variable called is_sharded_qlora
, that is determined here:
When I check this variable:
True
for both ranks.True
for rank 0 and False
for rank 1.This should not happen, it needs to always be True
when using QLoRA + FSDP. I don't know if the issue lies with transformers or with trl. I'll ping some colleagues to hopefully figure this out.
Regarding the issue with Llama 3.2 3B, I didn't have time to look into that yet, but let's first try to resolve this fundamental issue.
Edit
I also tried the latest transformers version (144852fb
) and there I got a really really weird error:
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/name/work/forks/peft/examples/sft/train.py", line 184, in <module>
[rank1]: main(model_args, data_args, training_args)
[rank1]: File "/home/name/work/forks/peft/examples/sft/train.py", line 107, in main
[rank1]: model, peft_config, tokenizer = create_and_prepare_model(model_args, data_args, training_args)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/work/forks/peft/examples/sft/utils.py", line 173, in create_and_prepare_model
[rank1]: model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=8)
[rank1]: File "/home/name/work/clones/transformers/src/transformers/modeling_utils.py", line 2087, in resize_token_embeddings
[rank1]: model_embeds = self._resize_token_embeddings(new_num_tokens, pad_to_multiple_of, mean_resizing)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/work/clones/transformers/src/transformers/modeling_utils.py", line 2112, in _resize_token_embeddings
[rank1]: new_embeddings = self._get_resized_embeddings(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/work/clones/transformers/src/transformers/modeling_utils.py", line 2266, in _get_resized_embeddings
[rank1]: self._init_added_embeddings_weights_with_mean(
[rank1]: File "/home/name/work/clones/transformers/src/transformers/modeling_utils.py", line 2446, in _init_added_embeddings_weights_with_mean
[rank1]: (covariance == covariance.T).all() and not torch.is_complex(eigenvalues) and (eigenvalues > 0).all()
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/_meta_registrations.py", line 6054, in meta_local_scalar_dense
[rank1]: raise RuntimeError("Tensor.item() cannot be called on meta tensors")
[rank1]: RuntimeError: Tensor.item() cannot be called on meta tensors
I have hopes that this will be resolved with the same fix, so I'd say we can ignore it for now :crossed_fingers:
Thanks for the analysis @BenjaminBossan! I'll just use the dtype coercion temporary fix for now while waiting for the root fix. If only the LoRA weights are float32, then your coercion code should also work right? Since that didn't work for me, I'm wondering if something else in the model is also float32.
Also, in my codebase, reverting to transformers==4.44.2
doesn't resolve the non-uniform dtype issue. I tested it with Llama 2 7B, Llama 3.1 8B, and Llama 3.2 3B.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
See https://github.com/huggingface/peft/issues/484
Expected behavior
The training code is able to handle the selected FP16 weights selected via accelerate config. Apologies for linking everything but it's all been provided already by another OP and I am up too late already debugging.