[BUG] RuntimeError encountered when generating tokens from a Meta-Llama-3-8B-Instruct model initialized with 4-bit or 8-bit quantization

Describe the bug I got the error RuntimeError: probability tensor contains eitherinf,nanor element < 0 when trying to run deepspeed_engine.generate when Meta-Llama-3-8B-Instruct is initialized with either 4-bit or 8-bit quantization.

To Reproduce

Run the following code

from typing import cast
from transformers.models.llama.modeling_llama import LlamaDecoderLayer
from deepspeed.module_inject.containers.llama import LLAMALayerPolicy
from functools import wraps

if not getattr(LLAMALayerPolicy, "is_get_hidden_heads_patched", False):
    # Apply the monkey patch copied from https://github.com/microsoft/DeepSpeed/pull/5624

    @wraps(LLAMALayerPolicy.get_hidden_heads)
    def patched_get_hidden_heads(self: LLAMALayerPolicy) -> tuple[int, int, float, int]:
        client_module = cast(LlamaDecoderLayer, self.client_module)
        hidden_heads = (
            client_module.self_attn.q_proj.in_features,
            client_module.self_attn.num_heads,
            client_module.input_layernorm.variance_epsilon,
            client_module.mlp.gate_proj.out_features,
        )
        return hidden_heads

    LLAMALayerPolicy.get_hidden_heads = patched_get_hidden_heads
    setattr(LLAMALayerPolicy, "is_get_hidden_heads_patched", True)

from os import environ

rank = 0
environ["RANK"] = str(rank)

local_rank = 0
environ["LOCAL_RANK"] = str(local_rank)

world_size = 1
environ["WORLD_SIZE"] = str(world_size)

deepspeed_config = {
    "zero_optimization": {
        "load_from_fp32_weights": False,
        "stage": 3,
        "zero_quantized_weights": True,
        "zero_quantized_nontrainable_weights": True,
    },
    "train_micro_batch_size_per_gpu": 1,
    "fp16": {"enabled": True},
    "weight_quantization": {
        "quantized_initialization": {
            # The same error occurs with either 4 or 8 bit quantization.
            # "num_bits": 4,
            "num_bits": 8,
            "group_size": 64,
            "group_dim": 1,
            "symmetric": False,
        }
    },
}

from transformers.integrations.deepspeed import HfDeepSpeedConfig

hf_deepspeed_config = HfDeepSpeedConfig(deepspeed_config)

import deepspeed.comm

deepspeed.comm.init_distributed(
    dist_backend="nccl",
    rank=rank,
    world_size=world_size,
    auto_mpi_discovery=False,
    init_method=f"tcp://127.0.0.1:9999",
)

from transformers import AutoModelForCausalLM
import torch

model_path = "meta-llama/Meta-Llama-3-8B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    use_flash_attention_2=True,
)

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_path)

from deepspeed.runtime.config import DeepSpeedConfig
from deepspeed import DeepSpeedEngine

deepspeed_engine = DeepSpeedEngine(
    args={},
    model=model,
    config=deepspeed_config,
    config_class=DeepSpeedConfig(deepspeed_config),
)

from transformers import GenerationConfig

generation_config = GenerationConfig.from_pretrained(model_path, max_new_tokens=20)

with torch.no_grad():
    deepspeed_engine.eval()
    print(
        tokenizer.batch_decode(
            deepspeed_engine.generate(
                torch.tensor(
                    [[tokenizer.bos_token_id]],
                    dtype=torch.int,
                    device=deepspeed_engine.device,
                ),
                synced_gpus=True,
                generation_config=generation_config,
            )
        )
    )

Then the output is

[2024-06-12 06:39:01,111] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.0), only 1.0.0 is known to be compatible
[2024-06-12 06:39:01,813] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-12 06:39:01,814] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Using /home/bo/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/bo/.cache/torch_extensions/py311_cu121/quantizer/build.ninja...
/home/bo/peftai/.venv/lib/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Building extension module quantizer...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module quantizer...
Time to load quantizer op: 0.12922883033752441 seconds
Using quantizer for weights: CUDAQuantizer
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2024-06-12 06:39:01,975] [INFO] [partition_parameters.py:562:patch_init_and_builtins] Enable Zero3 engine with INT4 quantization.
[2024-06-12 06:39:02,874] [INFO] [partition_parameters.py:345:__exit__] finished initializing model - num_params = 291, num_elems = 8.03B
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:09<00:00,  2.45s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-06-12 06:39:13,036] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-06-12 06:39:13,038] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
[2024-06-12 06:39:13,142] [INFO] [utils.py:779:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2024-06-12 06:39:13,142] [INFO] [utils.py:780:see_memory_usage] MA 8.08 GB         Max_MA 10.05 GB         CA 13.26 GB         Max_CA 13 GB 
[2024-06-12 06:39:13,143] [INFO] [utils.py:787:see_memory_usage] CPU Virtual Memory:  used = 67.62 GB, percent = 13.4%
Parameter Offload: Total persistent parameters: 266240 in 65 params
[2024-06-12 06:39:13,256] [INFO] [utils.py:779:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2024-06-12 06:39:13,257] [INFO] [utils.py:780:see_memory_usage] MA 8.08 GB         Max_MA 8.08 GB         CA 13.26 GB         Max_CA 13 GB 
[2024-06-12 06:39:13,257] [INFO] [utils.py:787:see_memory_usage] CPU Virtual Memory:  used = 67.62 GB, percent = 13.4%
[2024-06-12 06:39:13,258] [INFO] [config.py:996:print] DeepSpeedEngine configuration:
[2024-06-12 06:39:13,258] [INFO] [config.py:1000:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2024-06-12 06:39:13,258] [INFO] [config.py:1000:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-06-12 06:39:13,258] [INFO] [config.py:1000:print]   amp_enabled .................. False
[2024-06-12 06:39:13,258] [INFO] [config.py:1000:print]   amp_params ................... False
[2024-06-12 06:39:13,258] [INFO] [config.py:1000:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2024-06-12 06:39:13,258] [INFO] [config.py:1000:print]   bfloat16_enabled ............. False
[2024-06-12 06:39:13,258] [INFO] [config.py:1000:print]   bfloat16_immediate_grad_update  False
[2024-06-12 06:39:13,258] [INFO] [config.py:1000:print]   checkpoint_parallel_write_pipeline  False
[2024-06-12 06:39:13,258] [INFO] [config.py:1000:print]   checkpoint_tag_validation_enabled  True
[2024-06-12 06:39:13,258] [INFO] [config.py:1000:print]   checkpoint_tag_validation_fail  False
[2024-06-12 06:39:13,258] [INFO] [config.py:1000:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f5dda12c890>
[2024-06-12 06:39:13,258] [INFO] [config.py:1000:print]   communication_data_type ...... None
[2024-06-12 06:39:13,258] [INFO] [config.py:1000:print]   compile_config ............... enabled=False backend='inductor' kwargs={}
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   curriculum_enabled_legacy .... False
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   curriculum_params_legacy ..... False
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   data_efficiency_enabled ...... False
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   dataloader_drop_last ......... False
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   disable_allgather ............ False
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   dump_state ................... False
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   dynamic_loss_scale_args ...... None
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   eigenvalue_enabled ........... False
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   eigenvalue_gas_boundary_resolution  1
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   eigenvalue_layer_num ......... 0
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   eigenvalue_max_iter .......... 100
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   eigenvalue_stability ......... 1e-06
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   eigenvalue_tol ............... 0.01
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   eigenvalue_verbose ........... False
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   elasticity_enabled ........... False
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   fp16_auto_cast ............... False
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   fp16_enabled ................. True
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   fp16_master_weights_and_gradients  False
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   global_rank .................. 0
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   grad_accum_dtype ............. None
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   gradient_accumulation_steps .. 1
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   gradient_clipping ............ 0.0
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   gradient_predivide_factor .... 1.0
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   graph_harvesting ............. False
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   initial_dynamic_scale ........ 65536
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   load_universal_checkpoint .... False
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   loss_scale ................... 0
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   memory_breakdown ............. False
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   mics_hierarchial_params_gather  False
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   mics_shard_size .............. -1
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   optimizer_legacy_fusion ...... False
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   optimizer_name ............... None
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   optimizer_params ............. None
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   pld_enabled .................. False
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   pld_params ................... False
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   prescale_gradients ........... False
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   scheduler_name ............... None
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   scheduler_params ............. None
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   seq_parallel_communication_data_type  torch.float32
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   sparse_attention ............. None
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   sparse_gradients_enabled ..... False
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   steps_per_print .............. 10
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   train_batch_size ............. 1
[2024-06-12 06:39:13,259] [INFO] [config.py:1000:print]   train_micro_batch_size_per_gpu  1
[2024-06-12 06:39:13,260] [INFO] [config.py:1000:print]   use_data_before_expert_parallel_  False
[2024-06-12 06:39:13,260] [INFO] [config.py:1000:print]   use_node_local_storage ....... False
[2024-06-12 06:39:13,260] [INFO] [config.py:1000:print]   wall_clock_breakdown ......... False
[2024-06-12 06:39:13,260] [INFO] [config.py:1000:print]   weight_quantization_config ... q_type='symmetric' q_groups=1 enabled=True num_bits=8 quantized_initialization={'num_bits': 8, 'group_size': 64, 'group_dim': 1, 'symmetric': False} post_init_quant={}
[2024-06-12 06:39:13,260] [INFO] [config.py:1000:print]   world_size ................... 1
[2024-06-12 06:39:13,260] [INFO] [config.py:1000:print]   zero_allow_untested_optimizer  False
[2024-06-12 06:39:13,260] [INFO] [config.py:1000:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=False elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=True zero_quantized_nontrainable_weights=True zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-06-12 06:39:13,260] [INFO] [config.py:1000:print]   zero_enabled ................. True
[2024-06-12 06:39:13,260] [INFO] [config.py:1000:print]   zero_force_ds_cpu_optimizer .. True
[2024-06-12 06:39:13,260] [INFO] [config.py:1000:print]   zero_optimization_stage ...... 3
[2024-06-12 06:39:13,260] [INFO] [config.py:986:print_user_config]   json = {
    "zero_optimization": {
        "load_from_fp32_weights": false, 
        "stage": 3, 
        "zero_quantized_weights": true, 
        "zero_quantized_nontrainable_weights": true
    }, 
    "train_micro_batch_size_per_gpu": 1, 
    "fp16": {
        "enabled": true
    }, 
    "weight_quantization": {
        "quantized_initialization": {
            "num_bits": 8, 
            "group_size": 64, 
            "group_dim": 1, 
            "symmetric": false
        }
    }
}
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/bo/peftai/.deepspeed_llama3_8b.py", line 103, in <module>
[rank0]:     deepspeed_engine.generate(
[rank0]:   File "/home/bo/peftai/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bo/peftai/.venv/lib/python3.11/site-packages/transformers/generation/utils.py", line 1622, in generate
[rank0]:     result = self._sample(
[rank0]:              ^^^^^^^^^^^^^
[rank0]:   File "/home/bo/peftai/.venv/lib/python3.11/site-packages/transformers/generation/utils.py", line 2829, in _sample
[rank0]:     next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Expected behavior No error

ds_report output

[2024-06-12 06:42:07,584] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.0), only 1.0.0 is known to be compatible
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fp_quantizer ........... [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/bo/peftai/.venv/lib/python3.11/site-packages/torch']
torch version .................... 2.3.0+cu121
deepspeed install path ........... ['/home/bo/peftai/.venv/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.14.2, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.2
deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0
shared memory (/dev/shm) size .... 251.77 GB

Screenshots Not applicable

System info (please complete the following information):

OS: Ubuntu 22.04
GPU count and types: 1 x RTX 3090
Python version: 3.11.9
Any other relevant info about your setup

Launcher context Just python cli, not deepspeed cli.

Docker context Not using Docker

Additional context

accelerate==0.23.0
aiofiles==23.2.1
aiohttp==3.8.6
aiohttp-cors==0.7.0
aiosignal==1.3.13
annotated-types==0.6.0
anyio==4.3.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==2.4.0
async-lru==2.0.4
async-timeout==4.0.3
asyncstdlib==3.10.9
attrs==23.1.0
autoawq==0.2.5
autoawq_kernels==0.0.6
autoflake==2.2.1
azure-cli==2.60.0
Babel==2.14.0
backcall==0.2.0
beautifulsoup4==4.12.2
bitsandbytes==0.43.0
black==24.3.0
bleach==6.1.0
cached_classproperty==1.0.1
cachetools==5.3.1
certifi==2023.7.22
cffi==1.16.0
charset-normalizer==3.3.0
click==8.1.7
cloudpickle==3.0.0
cmake==3.29.2
colorful==0.5.6
comm==0.1.4
coverage==7.5.1
cryptography==41.0.4
datasets==2.18.0
debugpy==1.8.1
decorator==5.1.1
deepmerge==2.0b0
deepspeed==0.14.2
defusedxml==0.7.1
dill==0.3.8
diskcache==5.6.3
distlib==0.3.8
distro==1.9.0
ecdsa==0.18.0
einops==0.7.0
executing==2.0.0
fastapi==0.110.0
fastjsonschema==2.18.1
filelock==3.12.4
flash-attn @ https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.8/flash_attn-2.5.8+cu122torch2.3cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
fqdn==1.5.1
frozenlist==1.4.0
fsspec==2023.9.2
google-api-core==2.8.0
google-auth==2.29.0
googleapis-common-protos==1.56.1
gptcache==0.1.42
grpcio==1.63.0
guidance==0.0.64
h11==0.14.0
hiredis==2.2.3
hjson==3.1.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.19.4
idna==3.4
immutables==0.20
iniconfig==2.0.0
interegular==0.3.3
ipykernel==6.25.2
ipython==8.16.1
ipywidgets==8.1.2
isoduration==20.11.0
isort==5.13.2
jaraco.functools==3.9.0
jedi==0.19.1
Jinja2==3.1.2
joblib==1.3.2
json5==0.9.24
jsonpointer==2.4
jsonschema==4.19.1
jsonschema-specifications==2023.7.1
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-events==0.10.0
jupyter-lsp==2.2.4
jupyter_client==8.4.0
jupyter_core==5.4.0
jupyter_server==2.13.0
jupyter_server_terminals==0.5.3
jupyterlab==4.1.5
jupyterlab-pygments==0.2.2
jupyterlab_server==2.25.4
jupyterlab_widgets==3.0.10
lark==1.1.9
lazy-object-proxy==1.10.0
linkify-it-py==2.0.3
llvmlite==0.42.0
lm-format-enforcer==0.9.8
markdown-it-py==3.0.0
MarkupSafe==2.1.3
matplotlib-inline==0.1.6
mdit-py-plugins==0.4.1
mdurl==0.1.2
memray==1.12.0
mistune==3.0.2
more-itertools==9.1.0
mpmath==1.3.0
msal==1.24.1
msgpack==1.0.8
multidict==6.0.4
multiprocess==0.70.16
mypy-extensions==1.0.0
nbclient==0.8.0
nbconvert==7.9.2
nbformat==5.9.2
nbval==0.11.0
nest-asyncio==1.5.8
networkx==3.1
ninja==1.11.1.1
nodeenv==1.8.0
notebook==7.1.2
notebook_shim==0.2.4
numba==0.59.1
numpy==1.26.0
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==12.550.52
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.4.99
nvidia-nvtx-cu12==12.1.105
openai==1.25.2
opencensus==0.11.4
opencensus-context==0.1.3
outlines==0.0.34
overrides==7.7.0
packaging==23.2
pandas==2.2.1
pandocfilters==1.5.0
parso==0.8.3
pathspec==0.12.1
peft==0.5.0
pexpect==4.8.0
pickleshare==0.7.5
platformdirs==3.11.0
pluggy==1.5.0
poetry==1.8.3
pre_commit==3.7.1
prometheus-fastapi-instrumentator==7.0.0
prometheus_client==0.20.0
prompt-toolkit==3.0.39
protobuf==5.26.0
psutil==5.9.5
ptyprocess==0.7.0
pure-eval==0.2.2
py-cord==2.4.1
py-cpuinfo==9.0.0
py-spy==0.3.14
pyarrow==15.0.2
pyarrow-hotfix==0.6
pyasn1==0.5.0
pyasn1_modules==0.4.0
pycparser==2.21
pydantic==2.7.3
pydantic_core==2.18.4
pyflakes==3.1.0
pyflyby==1.9.2
Pygments==2.16.1
pygtrie==2.5.0
PyJWT==2.8.0
pynvml==11.5.0
pyparsing==3.1.1
pyright==1.1.359
PySide6==6.6.3
PySide6_Addons==6.6.3
PySide6_Essentials==6.6.3
pytest==8.2.0
python-dateutil==2.8.2
python-dotenv==1.0.1
python-jose==3.3.0
python-json-logger==2.0.7
python-ulid==1.1.0
pytz==2024.1
pyxll==5.8.0
pyxll_jupyter==0.5.2
PyYAML==6.0.1
pyzmq==25.1.1
qtconsole==5.5.1
QtPy==2.4.1
ray==2.23.0
redis==4.6.0
redis-om==0.3.1
referencing==0.30.2
regex==2023.10.3
requests==2.31.0
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.7.1
rpds-py==0.10.6
rsa==4.9
safetensors==0.4.2
scipy==1.11.3
Send2Trash==1.8.2
sentencepiece==0.2.0
shiboken6==6.6.3
six==1.16.0
smart-open==7.0.4
sniffio==1.3.1
soupsieve==2.5
stack-data==0.6.3
starlette==0.36.3
sympy==1.12
terminado==0.18.1
textual==0.65.2
tiktoken==0.6.0
tinycss2==1.2.1
tokenizers==0.19.1
toml==0.10.2
torch==2.3.0
tornado==6.3.3
tqdm==4.66.1
traitlets==5.11.2
transformers==4.40.1
triton==2.3.0
typeguard==4.1.5
types-pyOpenSSL==23.2.0.2
types-python-dateutil==2.9.0.20240316
types-redis==4.6.0.7
typing_extensions==4.8.0
tzdata==2024.1
uc-micro-py==1.0.3
uri-template==1.3.0
urllib3==2.0.6
uvicorn==0.29.0
uvloop==0.19.0
virtualenv==20.26.2
vllm==0.4.2
vllm_nccl_cu12==2.18.1.0.4.0
vulnix==1.10.2.dev0
watchfiles==0.21.0
wcwidth==0.2.8
webcolors==1.13
webencodings==0.5.1
websocket-client==1.7.0
websockets==12.0
widgetsnbextension==4.0.10
wrapt==1.16.0
xformers==0.0.26.post1
xxhash==3.4.1
yarl==1.9.2
zstandard==0.22.0

microsoft / DeepSpeed

[BUG] RuntimeError encountered when generating tokens from a Meta-Llama-3-8B-Instruct model initialized with 4-bit or 8-bit quantization #5644