Weird text encoder NaNs specifically for FSDP + multi GPU

System Info

transformers version: 4.45.0.dev0
Platform: Linux-5.15.0-1027-gcp-x86_64-with-glibc2.31
Python version: 3.9.19
Huggingface_hub version: 0.24.5
Safetensors version: 0.4.4
Accelerate version: 0.35.0.dev0
Accelerate config: not found
PyTorch version (GPU?): 2.4.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: distributed yes but I test with two custom yml files (see below)
Using GPU in script?: yes
GPU type: NVIDIA A100-SXM4-80GB

Both accelerate and transformers are all recent, installed fresh from github.

Who can help?

@ArthurZucker @muellerz just because it seems to be something to do with the combination of fsdp + the instantiation of the tokenizer classes

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I seem to be getting a weird issue when using multi-GPU with loading certain models from transformers. In the toy task below I am simply loading in some tokenizers and text encoders from a certain pretrained model, and yet oddly enough, when I am running this script under multi-GPU + FSDP I am getting NaNs in the text encoder.

For instance, with this script:

from accelerate import Accelerator
from transformers import CLIPTokenizer, T5EncoderModel, T5TokenizerFast, CLIPTextModel
from diffusers.utils import (
    check_min_version
)
import torch

def has_nan(tensor):
    if not isinstance(tensor, torch.Tensor):
        return f"not a tensor, but a {type(tensor)}"
    return torch.isnan(tensor).any().item()

def check_nan_weights(model, mod_name):
    nan_params = []
    for name, param in model.named_parameters():
        if torch.isnan(param.data).any():
            nan_params.append(name)

    if nan_params:
        print(f"[{torch.cuda.current_device()}, {mod_name}]: NaN weights detected in the following parameters:")
        for param_name in nan_params:
            print(f"  - {param_name}")
        return True
    return False

from logging import getLogger
logger = getLogger(__name__)

def load_pipeline(accelerator,
                  pretrained_model_name_or_path: str,
                  load_tokenizers: bool = True,
                  revision: str = None,
                  variant: str = None):

    #with accelerator.main_process_first():

    if load_tokenizers:

        # Load the tokenizers
        tokenizer_one = CLIPTokenizer.from_pretrained(
            pretrained_model_name_or_path,
            subfolder="tokenizer",
            revision=revision,
        )
        tokenizer_two = T5TokenizerFast.from_pretrained(
            pretrained_model_name_or_path,
            subfolder="tokenizer_2",
            revision=revision,
        )

    #accelerator.wait_for_everyone()

    text_encoder_one = CLIPTextModel.from_pretrained(
        pretrained_model_name_or_path, subfolder="text_encoder", 
        revision=revision, variant=variant
    )

    text_encoder_two = T5EncoderModel.from_pretrained(
        pretrained_model_name_or_path, subfolder="text_encoder_2", 
        revision=revision, variant=variant,
    )

    logger.info("check nan weights...")
    check_nan_weights(text_encoder_one, 'te')
    check_nan_weights(text_encoder_two, 'te2')

def main():

    accelerator = Accelerator()

    pipeline = load_pipeline(
        accelerator,
        "black-forest-labs/FLUX.1-dev",
        load_tokenizers=True
    )

if __name__ == "__main__":
    #from torch.multiprocessing import Pool, Process, set_start_method
    #set_start_method('spawn')
    main()

If we run this with 1 gpu via accelerate launch --config_file 1gpu.yml test.py we get no errors. However, with 2 gpu with accelerate launch --config_file 2gpu.yml test.py we get:

[1, te]: NaN weights detected in the following parameters:
  - text_model.encoder.layers.0.self_attn.k_proj.weight
  - text_model.encoder.layers.0.self_attn.k_proj.bias
  - text_model.encoder.layers.0.self_attn.v_proj.weight
  - text_model.encoder.layers.2.self_attn.out_proj.bias
  - text_model.encoder.layers.2.layer_norm1.weight
  - text_model.encoder.layers.2.layer_norm1.bias
 ...
 ...

Note that if we set load_tokenizers=False in load_pipeline, there are no issues. It seems to be something with the tokenizer. I thought this might be some race-condition related issue but when I tried to isolate that behaviour with e.g. the use of accelerator.wait_for_everyone() I still got the same issues.

Furthermore, if I just run the script with accelerate launch test.py with a default config (one which is as vanilla as can be, no FSDP and just enabling multi-GPU) then there are no errors to be found. So this seems to be an issue specifically at the intersection of FSDP and the tokenizer classes.

My accelerate config files are as follows for 1 gpu and 2 gpu (for 2 gpu just set num_processes: 2 of course).

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_activation_checkpointing: false
  fsdp_auto_wrap_policy: SIZE_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_min_num_params: 100000000
  fsdp_offload_params: false
  # SHARD_GRAD_OP was the previous strat
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: FULL_STATE_DICT
  # SHARDED_STATE_DICT was the old value for above
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Expected behavior

No NaNs.

huggingface / transformers