microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.44k stars 4.12k forks source link

Command '['ninja', '-v']' returned non-zero exit status 1 - Unsupported NVHPC compiler found #6654

Closed qmin2 closed 2 weeks ago

qmin2 commented 3 weeks ago

I encountered multiple issues while trying to perform full fine-tuning of the LLaMA 3 8B model with DeepSpeed with A100-80GB x 2.

As a result, I decided to follow the DeepSpeed tutorial on Huggingface.

Below is the command I used, which closely follows the example in the tutorial:

deepspeed examples/pytorch/translation/run_translation.py \
--deepspeed ds_config_zero3.json \
--model_name_or_path t5-small --per_device_train_batch_size 1 \
--output_dir output_dir --overwrite_output_dir --fp16 \
--do_train --max_train_samples 500 --num_train_epochs 1 \
--dataset_name wmt16 --dataset_config "ro-en" \
--source_lang en --target_lang ro

And this is ds_config_zero3.json

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Then I got this error

[2024-10-23 11:22:53,732] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=0,1: setting --include=localhost:0,1
[2024-10-23 11:22:53,732] [INFO] [runner.py:568:main] cmd = /home/qmin2/anaconda3/envs/biicae/bin/python3.9 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None run_translation.py --deepspeed ds_config.json --model_name_or_path t5-small --per_device_train_batch_size 1 --output_dir output_dir --overwrite_output_dir --fp16 --do_train --max_train_samples 500 --num_train_epochs 1 --dataset_name wmt16 --dataset_config ro-en --source_lang en --target_lang ro
[2024-10-23 11:22:56,867] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-23 11:22:58,128] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2024-10-23 11:22:58,128] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2024-10-23 11:22:58,128] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2024-10-23 11:22:58,128] [INFO] [launch.py:163:main] dist_world_size=2
[2024-10-23 11:22:58,128] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2024-10-23 11:22:58,138] [INFO] [launch.py:253:main] process 3332850 spawned with command: ['/home/qmin2/anaconda3/envs/biicae/bin/python3.9', '-u', 'run_translation.py', '--local_rank=0', '--deepspeed', 'ds_config.json', '--model_name_or_path', 't5-small', '--per_device_train_batch_size', '1', '--output_dir', 'output_dir', '--overwrite_output_dir', '--fp16', '--do_train', '--max_train_samples', '500', '--num_train_epochs', '1', '--dataset_name', 'wmt16', '--dataset_config', 'ro-en', '--source_lang', 'en', '--target_lang', 'ro']
[2024-10-23 11:22:58,154] [INFO] [launch.py:253:main] process 3332851 spawned with command: ['/home/qmin2/anaconda3/envs/biicae/bin/python3.9', '-u', 'run_translation.py', '--local_rank=1', '--deepspeed', 'ds_config.json', '--model_name_or_path', 't5-small', '--per_device_train_batch_size', '1', '--output_dir', 'output_dir', '--overwrite_output_dir', '--fp16', '--do_train', '--max_train_samples', '500', '--num_train_epochs', '1', '--dataset_name', 'wmt16', '--dataset_config', 'ro-en', '--source_lang', 'en', '--target_lang', 'ro']
[2024-10-23 11:23:03,294] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-23 11:23:03,580] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-10-23 11:23:03,580] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-10-23 11:23:03,619] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-23 11:23:03,874] [INFO] [comm.py:637:init_distributed] cdb=None
10/23/2024 11:23:05 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: True
10/23/2024 11:23:05 - INFO - __main__ - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=ds_config.json,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=no,
eval_use_gather_object=False,
evaluation_strategy=None,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_for_metrics=[],
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=output_dir/runs/Oct23_11-23-02_n57.gasi-cluster,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=500,
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=1.0,
optim=adamw_torch,
optim_args=None,
optim_target_modules=None,
output_dir=output_dir,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
predict_with_generate=False,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['wandb'],
restore_callback_states_from_checkpoint=False,
resume_from_checkpoint=None,
run_name=output_dir,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=500,
save_strategy=steps,
save_total_limit=None,
seed=42,
skip_memory_metrics=True,
sortish_sampler=False,
split_batches=None,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torch_empty_cache_steps=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_liger_kernel=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
10/23/2024 11:23:06 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: True
Overwrite dataset info from restored data version if exists.
10/23/2024 11:23:16 - INFO - datasets.builder - Overwrite dataset info from restored data version if exists.
Loading Dataset info from /home/qmin2/.cache/huggingface/datasets/wmt16/ro-en/0.0.0/41d8a4013aa1489f28fea60ec0932af246086482
10/23/2024 11:23:16 - INFO - datasets.info - Loading Dataset info from /home/qmin2/.cache/huggingface/datasets/wmt16/ro-en/0.0.0/41d8a4013aa1489f28fea60ec0932af246086482
Found cached dataset wmt16 (/home/qmin2/.cache/huggingface/datasets/wmt16/ro-en/0.0.0/41d8a4013aa1489f28fea60ec0932af246086482)
10/23/2024 11:23:16 - INFO - datasets.builder - Found cached dataset wmt16 (/home/qmin2/.cache/huggingface/datasets/wmt16/ro-en/0.0.0/41d8a4013aa1489f28fea60ec0932af246086482)
Loading Dataset info from /home/qmin2/.cache/huggingface/datasets/wmt16/ro-en/0.0.0/41d8a4013aa1489f28fea60ec0932af246086482
10/23/2024 11:23:16 - INFO - datasets.info - Loading Dataset info from /home/qmin2/.cache/huggingface/datasets/wmt16/ro-en/0.0.0/41d8a4013aa1489f28fea60ec0932af246086482
[INFO|configuration_utils.py:679] 2024-10-23 11:23:16,977 >> loading configuration file config.json from cache at /home/qmin2/.cache/huggingface/hub/models--t5-small/snapshots/df1b051c49625cf57a3d0d8d3863ed4d13564fe4/config.json
[INFO|configuration_utils.py:746] 2024-10-23 11:23:16,984 >> Model config T5Config {
  "_name_or_path": "t5-small",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "classifier_dropout": 0.0,
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 6,
  "num_heads": 8,
  "num_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to German: "
    },
    "translation_en_to_fr": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to French: "
    },
    "translation_en_to_ro": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to Romanian: "
    }
  },
  "transformers_version": "4.46.0.dev0",
  "use_cache": true,
  "vocab_size": 32128
}

[INFO|tokenization_utils_base.py:2211] 2024-10-23 11:23:17,206 >> loading file spiece.model from cache at /home/qmin2/.cache/huggingface/hub/models--t5-small/snapshots/df1b051c49625cf57a3d0d8d3863ed4d13564fe4/spiece.model
[INFO|tokenization_utils_base.py:2211] 2024-10-23 11:23:17,206 >> loading file tokenizer.json from cache at /home/qmin2/.cache/huggingface/hub/models--t5-small/snapshots/df1b051c49625cf57a3d0d8d3863ed4d13564fe4/tokenizer.json
[INFO|tokenization_utils_base.py:2211] 2024-10-23 11:23:17,206 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2211] 2024-10-23 11:23:17,206 >> loading file special_tokens_map.json from cache at None
[INFO|tokenization_utils_base.py:2211] 2024-10-23 11:23:17,206 >> loading file tokenizer_config.json from cache at /home/qmin2/.cache/huggingface/hub/models--t5-small/snapshots/df1b051c49625cf57a3d0d8d3863ed4d13564fe4/tokenizer_config.json
[INFO|modeling_utils.py:3936] 2024-10-23 11:23:17,446 >> loading weights file model.safetensors from cache at /home/qmin2/.cache/huggingface/hub/models--t5-small/snapshots/df1b051c49625cf57a3d0d8d3863ed4d13564fe4/model.safetensors
[INFO|modeling_utils.py:4079] 2024-10-23 11:23:17,453 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
[INFO|configuration_utils.py:1099] 2024-10-23 11:23:17,458 >> Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0
}

[2024-10-23 11:23:18,839] [INFO] [partition_parameters.py:343:__exit__] finished initializing model - num_params = 132, num_elems = 0.08B
[INFO|modeling_utils.py:4799] 2024-10-23 11:23:19,120 >> All model checkpoint weights were used when initializing T5ForConditionalGeneration.

[INFO|modeling_utils.py:4807] 2024-10-23 11:23:19,120 >> All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training.
[INFO|configuration_utils.py:1054] 2024-10-23 11:23:19,341 >> loading configuration file generation_config.json from cache at /home/qmin2/.cache/huggingface/hub/models--t5-small/snapshots/df1b051c49625cf57a3d0d8d3863ed4d13564fe4/generation_config.json
[INFO|configuration_utils.py:1099] 2024-10-23 11:23:19,341 >> Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0
}

[INFO|modeling_utils.py:2230] 2024-10-23 11:23:19,361 >> You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32100. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc
Loading cached processed dataset at /home/qmin2/.cache/huggingface/datasets/wmt16/ro-en/0.0.0/41d8a4013aa1489f28fea60ec0932af246086482/cache-8862cd207eac132f.arrow
10/23/2024 11:23:19 - INFO - datasets.arrow_dataset - Loading cached processed dataset at /home/qmin2/.cache/huggingface/datasets/wmt16/ro-en/0.0.0/41d8a4013aa1489f28fea60ec0932af246086482/cache-8862cd207eac132f.arrow
10/23/2024 11:23:20 - WARNING - accelerate.utils.other - Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO|trainer.py:688] 2024-10-23 11:23:21,317 >> Using auto half precision backend
[2024-10-23 11:23:21,487] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.2, git-hash=unknown, git-branch=unknown
[2024-10-23 11:23:21,493] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /home/qmin2/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/utils/cpp_extension.py:362: UserWarning: 

                               !! WARNING !!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Your compiler (/opt/ohpc/pub/apps/nvidia/hpc_sdk/Linux_x86_64/22.2/compilers/bin/nvc++) is not compatible with the compiler Pytorch was
built with for this platform, which is g++ on linux. Please
use g++ to to compile your extension. Alternatively, you may
compile PyTorch from source using /opt/ohpc/pub/apps/nvidia/hpc_sdk/Linux_x86_64/22.2/compilers/bin/nvc++, and then you can also use
/opt/ohpc/pub/apps/nvidia/hpc_sdk/Linux_x86_64/22.2/compilers/bin/nvc++ to compile your extension.

See https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md for help
with compiling PyTorch from source.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

                              !! WARNING !!

  warnings.warn(WRONG_COMPILER_WARNING.format(
Detected CUDA files, patching ldflags
Emitting ninja build file /home/qmin2/.cache/torch_extensions/py39_cu121/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /opt/ohpc/pub/apps/nvidia/hpc_sdk/Linux_x86_64/22.2/compilers/bin/nvc++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include/TH -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include/THC -isystem /opt/ohpc/pub/apps/cuda/11.8/include -isystem /home/qmin2/anaconda3/envs/biicae/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DBF16_AVAILABLE -c /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o 
FAILED: fused_adam_frontend.o 
/opt/ohpc/pub/apps/nvidia/hpc_sdk/Linux_x86_64/22.2/compilers/bin/nvc++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include/TH -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include/THC -isystem /opt/ohpc/pub/apps/cuda/11.8/include -isystem /home/qmin2/anaconda3/envs/biicae/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DBF16_AVAILABLE -c /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o 
nvc++-Error-Unknown switch: -Wno-reorder
[2/3] /opt/ohpc/pub/apps/cuda/11.8/bin/nvcc  -ccbin /opt/ohpc/pub/apps/nvidia/hpc_sdk/Linux_x86_64/22.2/compilers/bin/nvc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include/TH -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include/THC -isystem /opt/ohpc/pub/apps/cuda/11.8/include -isystem /home/qmin2/anaconda3/envs/biicae/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -std=c++17 -c /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
FAILED: multi_tensor_adam.cuda.o 
/opt/ohpc/pub/apps/cuda/11.8/bin/nvcc  -ccbin /opt/ohpc/pub/apps/nvidia/hpc_sdk/Linux_x86_64/22.2/compilers/bin/nvc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include/TH -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include/THC -isystem /opt/ohpc/pub/apps/cuda/11.8/include -isystem /home/qmin2/anaconda3/envs/biicae/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -std=c++17 -c /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
nvcc fatal   : Unsupported NVHPC compiler found. nvc++ is the only NVHPC compiler that is supported.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2100, in _run_ninja_build
    subprocess.run(
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/subprocess.py", line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/qmin2/3rd_semester_research/qmin2_infini_attention/run_translation.py", line 699, in <module>
    main()
  File "/home/qmin2/3rd_semester_research/qmin2_infini_attention/run_translation.py", line 614, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/transformers/trainer.py", line 2112, in train
    return inner_training_loop(
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/transformers/trainer.py", line 2267, in _inner_training_loop
    model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/accelerate/accelerator.py", line 1219, in prepare
    result = self._prepare_deepspeed(*args)
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/accelerate/accelerator.py", line 1604, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/__init__.py", line 176, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 307, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1231, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1308, in _configure_basic_optimizer
    optimizer = FusedAdam(
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 478, in load
    return self.jit_load(verbose)
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 522, in jit_load
    op_module = load(name=self.name,
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1308, in load
    return _jit_compile(
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1710, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1823, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2116, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'
Using /home/qmin2/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/utils/cpp_extension.py:362: UserWarning: 

                               !! WARNING !!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Your compiler (/opt/ohpc/pub/apps/nvidia/hpc_sdk/Linux_x86_64/22.2/compilers/bin/nvc++) is not compatible with the compiler Pytorch was
built with for this platform, which is g++ on linux. Please
use g++ to to compile your extension. Alternatively, you may
compile PyTorch from source using /opt/ohpc/pub/apps/nvidia/hpc_sdk/Linux_x86_64/22.2/compilers/bin/nvc++, and then you can also use
/opt/ohpc/pub/apps/nvidia/hpc_sdk/Linux_x86_64/22.2/compilers/bin/nvc++ to compile your extension.

See https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md for help
with compiling PyTorch from source.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

                              !! WARNING !!

  warnings.warn(WRONG_COMPILER_WARNING.format(
Detected CUDA files, patching ldflags
Emitting ninja build file /home/qmin2/.cache/torch_extensions/py39_cu121/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /opt/ohpc/pub/apps/nvidia/hpc_sdk/Linux_x86_64/22.2/compilers/bin/nvc++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include/TH -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include/THC -isystem /opt/ohpc/pub/apps/cuda/11.8/include -isystem /home/qmin2/anaconda3/envs/biicae/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DBF16_AVAILABLE -c /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o 
FAILED: fused_adam_frontend.o 
/opt/ohpc/pub/apps/nvidia/hpc_sdk/Linux_x86_64/22.2/compilers/bin/nvc++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include/TH -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include/THC -isystem /opt/ohpc/pub/apps/cuda/11.8/include -isystem /home/qmin2/anaconda3/envs/biicae/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DBF16_AVAILABLE -c /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o 
nvc++-Error-Unknown switch: -Wno-reorder
[2/3] /opt/ohpc/pub/apps/cuda/11.8/bin/nvcc  -ccbin /opt/ohpc/pub/apps/nvidia/hpc_sdk/Linux_x86_64/22.2/compilers/bin/nvc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include/TH -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include/THC -isystem /opt/ohpc/pub/apps/cuda/11.8/include -isystem /home/qmin2/anaconda3/envs/biicae/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -std=c++17 -c /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
FAILED: multi_tensor_adam.cuda.o 
/opt/ohpc/pub/apps/cuda/11.8/bin/nvcc  -ccbin /opt/ohpc/pub/apps/nvidia/hpc_sdk/Linux_x86_64/22.2/compilers/bin/nvc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include/TH -isystem /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/include/THC -isystem /opt/ohpc/pub/apps/cuda/11.8/include -isystem /home/qmin2/anaconda3/envs/biicae/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -std=c++17 -c /home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
nvcc fatal   : Unsupported NVHPC compiler found. nvc++ is the only NVHPC compiler that is supported.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2100, in _run_ninja_build
    subprocess.run(
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/subprocess.py", line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/qmin2/3rd_semester_research/qmin2_infini_attention/run_translation.py", line 699, in <module>
    main()
  File "/home/qmin2/3rd_semester_research/qmin2_infini_attention/run_translation.py", line 614, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/transformers/trainer.py", line 2112, in train
    return inner_training_loop(
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/transformers/trainer.py", line 2267, in _inner_training_loop
    model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/accelerate/accelerator.py", line 1219, in prepare
    result = self._prepare_deepspeed(*args)
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/accelerate/accelerator.py", line 1604, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/__init__.py", line 176, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 307, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1231, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1308, in _configure_basic_optimizer
    optimizer = FusedAdam(
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 478, in load
    return self.jit_load(verbose)
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 522, in jit_load
    op_module = load(name=self.name,
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1308, in load
    return _jit_compile(
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1710, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1823, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/home/qmin2/anaconda3/envs/biicae/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2116, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'
[2024-10-23 11:23:24,181] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3332850
[2024-10-23 11:23:24,181] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3332851
[2024-10-23 11:23:24,525] [ERROR] [launch.py:322:sigkill_handler] ['/home/qmin2/anaconda3/envs/biicae/bin/python3.9', '-u', 'run_translation.py', '--local_rank=1', '--deepspeed', 'ds_config.json', '--model_name_or_path', 't5-small', '--per_device_train_batch_size', '1', '--output_dir', 'output_dir', '--overwrite_output_dir', '--fp16', '--do_train', '--max_train_samples', '500', '--num_train_epochs', '1', '--dataset_name', 'wmt16', '--dataset_config', 'ro-en', '--source_lang', 'en', '--target_lang', 'ro'] exits with return code = 1

For your information I'm using slurm cluster interactive mode.

GPU: A100-80GB x 2 gcc --version : 12.2.0 nvcc --version : 11.8 nvc++ --version: nvc++ 22.2-0 64-bit target on x86-64 Linux -tp zen3

nvidia-smi shows NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1

This is pip list

pip list
Package                  Version
------------------------ ------------
accelerate               1.0.1
aiohappyeyeballs         2.4.3
aiohttp                  3.10.10
aiosignal                1.3.1
annotated-types          0.7.0
async-timeout            4.0.3
attrs                    24.2.0
certifi                  2024.8.30
charset-normalizer       3.4.0
colorama                 0.4.6
datasets                 3.0.2
deepspeed                0.15.3
dill                     0.3.8
evaluate                 0.4.3
filelock                 3.13.1
frozenlist               1.4.1
fsspec                   2024.2.0
hjson                    3.1.0
huggingface-hub          0.26.1
idna                     3.10
Jinja2                   3.1.3
lxml                     5.3.0
MarkupSafe               2.1.5
mpmath                   1.3.0
msgpack                  1.1.0
multidict                6.1.0
multiprocess             0.70.16
networkx                 3.2.1
ninja                    1.11.1.1
numpy                    1.26.3
nvidia-cublas-cu11       11.11.3.6
nvidia-cuda-cupti-cu11   11.8.87
nvidia-cuda-nvrtc-cu11   11.8.89
nvidia-cuda-runtime-cu11 11.8.89
nvidia-cudnn-cu11        9.1.0.70
nvidia-cufft-cu11        10.9.0.58
nvidia-curand-cu11       10.3.0.86
nvidia-cusolver-cu11     11.4.1.48
nvidia-cusparse-cu11     11.7.5.86
nvidia-nccl-cu11         2.21.5
nvidia-nvtx-cu11         11.8.86
packaging                24.1
pandas                   2.2.3
pillow                   10.2.0
pip                      24.2
portalocker              2.10.1
propcache                0.2.0
psutil                   6.1.0
py-cpuinfo               9.0.0
pyarrow                  17.0.0
pydantic                 2.9.2
pydantic_core            2.23.4
pynvml                   11.5.3
python-dateutil          2.9.0.post0
pytz                     2024.2
PyYAML                   6.0.2
regex                    2024.9.11
requests                 2.32.3
sacrebleu                2.4.3
safetensors              0.4.5
setuptools               75.1.0
six                      1.16.0
sympy                    1.13.1
tabulate                 0.9.0
tokenizers               0.20.1
torch                    2.5.0+cu118
torchaudio               2.5.0+cu118
torchvision              0.20.0+cu118
tqdm                     4.66.5
transformers             4.46.0.dev0
triton                   3.1.0
typing_extensions        4.9.0
tzdata                   2024.2
urllib3                  2.2.3
wheel                    0.44.0
xxhash                   3.5.0
yarl                     1.16.0

I spent lots of time handling this issue... Is there any solution for this?

loadams commented 2 weeks ago

Hi @qmin2 - the error doesn't look to come from DeepSpeed, the underlying error is an nvcc_fatal here:

nvcc fatal   : Unsupported NVHPC compiler found. nvc++ is the only NVHPC compiler that is supported.

I believe this means you either have an outdated nvcc or nvc++ so you should try to update those and then run again. If you need, please share the versions of both of those.

loadams commented 2 weeks ago

@qmin2 - closing for now since this doesn't appear to be a DeepSpeed issue to me. If you are still having problems after resolving the nvhpc compiler issue, please comment here and we can re-open this issue as well. Thanks!