When Finetuning Llama3, Error occurs

When I finetune MPT, the code is OK. But when I fine tune Llama I get the following error.

----------Begin global rank 2 STDERR----------
2024-09-02 20:12:15,331: rank2[3924][MainThread]: DEBUG: llmfoundry.command_utils.train: Initializing dist with device...
2024-09-02 20:12:15,606: rank2[3924][MainThread]: DEBUG: llmfoundry.command_utils.train: Testing barrier with device...
2024-09-02 20:12:18,695: rank2[3924][MainThread]: DEBUG: llmfoundry.command_utils.train: Barrier test passed with device.
/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/llmfoundry/utils/config_utils.py:527: UserWarning: Setting `sync_module_states = True` for FSDP. This is required when using mixed initialization.
  warnings.warn((
2024-09-02 20:12:18,698: rank2[3924][MainThread]: INFO: llmfoundry.command_utils.train: Building tokenizer...
2024-09-02 20:12:19,479: rank2[3924][MainThread]: INFO: llmfoundry.command_utils.train: Building train loader...
2024-09-02 20:12:19,480: rank2[3924][MainThread]: INFO: llmfoundry.data.finetuning.tasks: No preprocessor was supplied and no preprocessing function is registered for dataset name "json". No additional preprocessing will be applied. If the dataset is already formatted correctly, you can ignore this message.
2024-09-02 20:12:19,480: rank2[3924][MainThread]: DEBUG: llmfoundry.data.finetuning.tasks: Waiting for local_rank 0 to finish data prep
/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/llmfoundry/data/finetuning/tasks.py:998: UserWarning: Dropped 1662 examples where the prompt was longer than 2048, the prompt or response was empty, or the response was all padding tokens.
  warnings.warn(
2024-09-02 20:13:07,740: rank2[3924][MainThread]: DEBUG: llmfoundry.data.finetuning.tasks: All ranks finished data prep
2024-09-02 20:13:07,742: rank2[3924][MainThread]: INFO: llmfoundry.command_utils.train: Building eval loader...
2024-09-02 20:13:07,743: rank2[3924][MainThread]: INFO: llmfoundry.data.finetuning.tasks: No preprocessor was supplied and no preprocessing function is registered for dataset name "json". No additional preprocessing will be applied. If the dataset is already formatted correctly, you can ignore this message.
2024-09-02 20:13:07,743: rank2[3924][MainThread]: DEBUG: llmfoundry.data.finetuning.tasks: Waiting for local_rank 0 to finish data prep
2024-09-02 20:13:44,569: rank2[3924][MainThread]: DEBUG: llmfoundry.data.finetuning.tasks: All ranks finished data prep
2024-09-02 20:13:44,570: rank2[3924][MainThread]: INFO: llmfoundry.command_utils.train: Initializing model...
/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py:957: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:924: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
2024-09-02 20:14:38,212: rank2[3924][MainThread]: INFO: llmfoundry.command_utils.train: Building trainer...
2024-09-02 20:14:38,261: rank2[3924][MainThread]: INFO: composer.utils.reproducibility: Setting seed to 19
2024-09-02 20:14:38,263: rank2[3924][MainThread]: INFO: composer.trainer.trainer: Run name: llm
2024-09-02 20:14:38,265: rank2[3924][MainThread]: INFO: composer.core.state: Automatically setting data_parallel_shard to have parallelization degree 4.
2024-09-02 20:14:38,507: rank2[3924][MainThread]: INFO: composer.trainer.trainer: Stepping schedulers every batch. To step schedulers every epoch, set `step_schedulers_every_batch=False`.
2024-09-02 20:14:38,537: rank2[3924][MainThread]: INFO: composer.utils.reproducibility: Setting seed to 17
2024-09-02 20:14:43,115: rank2[3924][MainThread]: DEBUG: composer.utils.reproducibility: Restoring the RNG state
2024-09-02 20:14:43,115: rank2[3924][MainThread]: INFO: composer.trainer.trainer: Setting seed to 19
2024-09-02 20:14:43,115: rank2[3924][MainThread]: INFO: composer.utils.reproducibility: Setting seed to 19
2024-09-02 20:14:43,117: rank2[3924][MainThread]: INFO: llmfoundry.command_utils.train: Logging config
[Eval batch=1/15] Eval on eval data
[Eval batch=2/15] Eval on eval data
[Eval batch=4/15] Eval on eval data
[Eval batch=5/15] Eval on eval data
[Eval batch=7/15] Eval on eval data
[Eval batch=8/15] Eval on eval data
[Eval batch=9/15] Eval on eval data
[Eval batch=11/15] Eval on eval data
[Eval batch=12/15] Eval on eval data
[Eval batch=14/15] Eval on eval data
[Eval batch=15/15] Eval on eval data:
         Eval metrics/eval/LanguageCrossEntropy: 1.7547
         Eval metrics/eval/LanguagePerplexity: 5.7815
         Eval metrics/eval/TokenAccuracy: 0.5610
2024-09-02 20:15:05,595: rank2[3924][MainThread]: INFO: llmfoundry.command_utils.train: Starting training...
2024-09-02 20:15:05,595: rank2[3924][MainThread]: INFO: composer.trainer.trainer: Using precision Precision.AMP_BF16
2024-09-02 20:15:05,596: rank2[3924][MainThread]: DEBUG: composer.trainer.trainer: Spinning the dataloaders
2024-09-02 20:15:05,640: rank2[3924][MainThread]: DEBUG: composer.trainer.trainer: Starting training loop
/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/llmfoundry/data/finetuning/collator.py:367: UserWarning: Truncating sequence of length=2059 to fit max_seq_len=2048. If truncation is a problem, consider increasing max_seq_len.
  warnings.warn(
/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/llmfoundry/data/finetuning/collator.py:367: UserWarning: Truncating sequence of length=2049 to fit max_seq_len=2048. If truncation is a problem, consider increasing max_seq_len.
  warnings.warn(
[rank2]: Traceback (most recent call last):
[rank2]:   File "/workspace/user_code/llm-foundry/scripts/train/train.py", line 9, in <module>
[rank2]:     train_from_yaml(yaml_path, args_list)
[rank2]:   File "/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/llmfoundry/command_utils/train.py", line 606, in train_from_yaml
[rank2]:     return train(yaml_cfg)
[rank2]:            ^^^^^^^^^^^^^^^
[rank2]:   File "/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/llmfoundry/command_utils/train.py", line 587, in train
[rank2]:     trainer.fit()
[rank2]:   File "/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/composer/trainer/trainer.py", line 2340, in fit
[rank2]:     self._train_loop()
[rank2]:   File "/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/composer/trainer/trainer.py", line 2550, in _train_loop
[rank2]:     total_loss_dict = self._train_batch(use_grad_scaling)
[rank2]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/composer/trainer/trainer.py", line 2759, in _train_batch
[rank2]:     optimizer.step(
[rank2]:   File "/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
[rank2]:     return wrapped(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/torch/optim/optimizer.py", line 391, in wrapper
[rank2]:     out = func(*args, **kwargs)
[rank2]:           ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank2]:     return func(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/composer/optim/decoupled_weight_decay.py", line 308, in step
[rank2]:     loss = closure()
[rank2]:            ^^^^^^^^^
[rank2]:   File "/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/composer/trainer/trainer.py", line 2761, in <lambda>
[rank2]:     **kwargs: self._train_microbatches(microbatches, loss_dict, **kwargs).item(),
[rank2]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/composer/trainer/trainer.py", line 2883, in _train_microbatches
[rank2]:     microbatch_loss_dict = self._train_microbatch(use_grad_scaling, current_batch_size, is_final_microbatch)
[rank2]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/composer/trainer/trainer.py", line 2959, in _train_microbatch
[rank2]:     self.state.outputs = self.state.model(self.state.batch)
[rank2]:                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank2]:     return self._call_impl(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank2]:     return forward_call(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/composer/models/huggingface.py", line 488, in forward
[rank2]:     output = self.model(**batch)  # type: ignore (thirdparty)
[rank2]:              ^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank2]:     return self._call_impl(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank2]:     return forward_call(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 857, in forward
[rank2]:     output = self._fsdp_wrapped_module(*args, **kwargs)
[rank2]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank2]:     return self._call_impl(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank2]:     return forward_call(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/miniconda3/envs/env-3.11.9/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1161, in forward
[rank2]:     logits = logits.float()
[rank2]:              ^^^^^^^^^^^^^^
[rank2]: RuntimeError: r.nvmlDeviceGetNvLinkRemoteDeviceType_ INTERNAL ASSERT FAILED at "../c10/cuda/driver_api.cpp":27, please report a bug to PyTorch. Can't find nvmlDeviceGetNvLinkRemoteDeviceType: /lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetNvLinkRemoteDeviceType
2024-09-02 20:15:07,521: rank2[3924][MainThread]: DEBUG: composer.core.engine: Closing the engine.
2024-09-02 20:15:07,522: rank2[3924][MainThread]: DEBUG: composer.core.engine: Closing callback ConsoleLogger
2024-09-02 20:15:07,522: rank2[3924][MainThread]: DEBUG: composer.core.engine: Closing callback SpeedMonitor
2024-09-02 20:15:07,522: rank2[3924][MainThread]: DEBUG: composer.core.engine: Closing callback LRMonitor
2024-09-02 20:15:07,522: rank2[3924][MainThread]: DEBUG: composer.core.engine: Closing callback MemoryMonitor
2024-09-02 20:15:07,522: rank2[3924][MainThread]: DEBUG: composer.core.engine: Closing callback RuntimeEstimator
2024-09-02 20:15:07,522: rank2[3924][MainThread]: DEBUG: composer.core.engine: Post-closing callback ConsoleLogger
2024-09-02 20:15:07,522: rank2[3924][MainThread]: DEBUG: composer.core.engine: Post-closing callback SpeedMonitor
2024-09-02 20:15:07,522: rank2[3924][MainThread]: DEBUG: composer.core.engine: Post-closing callback LRMonitor
2024-09-02 20:15:07,522: rank2[3924][MainThread]: DEBUG: composer.core.engine: Post-closing callback MemoryMonitor
2024-09-02 20:15:07,522: rank2[3924][MainThread]: DEBUG: composer.core.engine: Post-closing callback RuntimeEstimator
2024-09-02 20:15:07,574: rank2[3924][MainThread]: DEBUG: composer.core.engine: Engine closed.

----------End global rank 2 STDERR----------
ERROR:composer.cli.launcher:Global rank 0 (PID 3922) exited with code 1

Environment： Cuda 12.2 Pytorch: 2.3.1

my finetune configure yamls is:

variables:
  global_seed: 17

  max_seq_len: 2048

  # Run Name
  run_name:  # If left blank, will be read from env var $RUN_NAME

max_seq_len: ${variables.max_seq_len}
run_name: ${variables.run_name}

model:
  name: hf_causal_lm
  init_device: mixed
  pretrained: true
  pretrained_model_name_or_path: /cfs/cfs-ehjtlivr/models/Meta-Llama-3.1-8B-Instruct
  use_flash_attention_2: true

# Tokenizer
tokenizer:
  name: /cfs/cfs-ehjtlivr/models/Meta-Llama-3.1-8B-Instruct
  kwargs:
    model_max_length: ${variables.max_seq_len}

# Dataloaders
train_loader:
  name: finetuning
  dataset:
    hf_name: json
    hf_kwargs:
      # Note: absolute paths for data_dir are more reliable;
      # relative paths will be interpreted relative to whatever your
      # working directory is when you run `train.py`
      data_dir: ./data/
    split: train
    max_seq_len: ${variables.max_seq_len}
    allow_pad_trimming: false
    decoder_only_format: true
    shuffle: true
    # # Use packing_ratio: 'auto' to automatically profile and select the highest observed packing ratio with
    # # zero waste. In practice, this may result in > 0 waste because profiling is done on only a portion
    # # of the dataset.
    # # Or use `python llmfoundry/scripts/misc/profile_packing.py --yaml-path /path/to/this/yaml/ ...`
    # # to profile this run's optimal packing_ratio as it depends on GPU count,
    # # batch size, sequence length
    # packing_ratio: auto
  drop_last: true
  num_workers: 8
  pin_memory: false
  prefetch_factor: 2
  persistent_workers: true
  timeout: 0

eval_loader:
  name: finetuning
  dataset:
    hf_name: json
    hf_kwargs:
      # Note: absolute paths for data_dir are more reliable;
      # relative paths will be interpreted relative to whatever your
      # working directory is when you run `train.py`
      data_dir: ./data/
    split: test
    max_seq_len: ${variables.max_seq_len}
    allow_pad_trimming: false
    decoder_only_format: true
    # packing_ratio:
    shuffle: false
  drop_last: true
  num_workers: 8
  pin_memory: false
  prefetch_factor: 2
  persistent_workers: true
  timeout: 0

# Optimization
scheduler:
  name: linear_decay_with_warmup  # linear no warmup is HF default which dolly used
  t_warmup: 50ba  # add some warmup though, seems to help with MPT
  alpha_f: 0

optimizer:
  # Based on Dolly
  name: decoupled_adamw
  lr: 5.0e-6
  betas:
  - 0.9
  - 0.999
  eps: 1.0e-8
  weight_decay: 0

algorithms:
  gradient_clipping:
    clipping_type: norm
    clipping_threshold: 1.0

max_duration: 2ep  # 2-3 epochs seems like the sweet spot
eval_interval: 1ep
# eval_subset_num_batches: -1
eval_first: true
global_train_batch_size: 48  # somewhere in the 6-8 * numgpus range seems good

# System
seed: ${variables.global_seed}
device_eval_batch_size: 8
device_train_microbatch_size: 8
# device_train_microbatch_size: auto
precision: amp_bf16

# FSDP
fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: PURE
  activation_checkpointing: true
  activation_checkpointing_reentrant: false
  activation_cpu_offload: false
  limit_all_gathers: true

# Logging
progress_bar: false
log_to_console: true
console_log_interval: 1ba

callbacks:
  speed_monitor:
    window_size: 10
  lr_monitor: {}
  memory_monitor: {}
  runtime_estimator: {}

# loggers:
#   wandb: {}

# Checkpoint to local filesystem or remote object store
# save_interval: 5000ba
# save_num_checkpoints_to_keep: 1  # Important, this cleans up checkpoints saved to DISK
# save_folder: ./{run_name}/checkpoints
# save_folder: s3://my-bucket/my-folder/{run_name}/checkpoints

mosaicml / llm-foundry

When Finetuning Llama3, Error occurs #1508