Closed ari9dam closed 1 year ago
Relevant: https://github.com/huggingface/transformers/pull/26631 @pacman100
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_forward_prefetch: false
fsdp_offload_params: true
fsdp_sharding_strategy: 1
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sync_module_states: true
fsdp_transformer_layer_cls_to_wrap: MistralDecoderLayer
fsdp_use_orig_params: true
main_training_function: main
mixed_precision: bf16
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Hello @ari9dam,
The PR you tagged above should resolve this issue. Please recreate the FSDP config via accelerate config
command and answer False
for RAM efficient loading of the pretrained model.
Thank you that solved it. I've one more question: @pacman100 model = transformers.AutoModelForCausalLM.from_pretrained( model_args.model_name_or_path, cache_dir=training_args.cache_dir, use_flash_attention_2=True )
should I pass torch dtype here while loading the model? I'm using bf16 in accelerate config. I get warnings:
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model initialized on CPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda')
.
also had this issue and fixed it by changing
if (
is_deepspeed_zero3_enabled() and torch.distributed.is_initialized() and torch.distributed.get_rank() > 0
) or (is_fsdp_enabled() and not is_local_dist_rank_0()):
map_location = "meta"
to
if (
(is_deepspeed_zero3_enabled() or is_fsdp_enabled())
and torch.distributed.is_initialized()
and (torch.distributed.get_rank() % 8 != 0)
):
map_location = "meta"
here https://github.com/huggingface/transformers/blob/29e7a1e1834f331a4916853ecd58549ed78235d6/src/transformers/modeling_utils.py#L512 (this is for 8 gpus per node; for 4 gpus per node should be 4 etc)
System Info
Who can help?
@muellerz @pacman100
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The training job works on A100 with 1 node and 8 GPUs. It fails when job uses more than 1 node with the error:
Expected behavior
No error