hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.3k stars 4.3k forks source link

[BUG]: FileNotFoundError: [Errno 2] No such file or directory: '/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/extensions/pybind/inference/inference.cpp' #5805

Closed teis-e closed 3 weeks ago

teis-e commented 3 weeks ago

Is there an existing issue for this bug?

🐛 Describe the bug

$ ls ~/llama8b/TensorRT-LLM/Meta-Llama-3-8B-Instruct
config.json             LICENSE                           model-00002-of-00004.safetensors  model-00004-of-00004.safetensors  original   special_tokens_map.json  tokenizer.json
generation_config.json  model-00001-of-00004.safetensors  model-00003-of-00004.safetensors  model.safetensors.index.json      README.md  tokenizer_config.json    USE_POLICY.md

~/ColossalAI/examples/inference/llama$ colossalai run --nproc_per_node 1 llama_generation.py -m "~/llama8b/TensorRT-LLM/Meta-Llama-3-8B-Instruct" --max_length 80

Output:

/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/torch/utils/_pytree.py:300: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
  warnings.warn(
/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
  warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/extensions/utils.py:96: UserWarning: [extension] The CUDA version on the system (12.4) does not match with the version (12.1) torch was compiled with. The mismatch is found in the minor version. As the APIs are compatible, we will allow compilation to proceed. If you encounter any issue when using the built kernel, please try to build it again with fully matched CUDA versions
  warnings.warn(
[extension] Compiling the JIT inference_ops_cuda kernel during runtime now
Traceback (most recent call last):
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/extensions/cpp_extension.py", line 132, in load
    op_kernel = self.import_op()
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/extensions/cpp_extension.py", line 61, in import_op
    return importlib.import_module(self.prebuilt_import_path)
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1004, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'colossalai._C.inference_ops_cuda'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sw/ColossalAI/examples/inference/llama/llama_generation.py", line 8, in <module>
    from colossalai.inference.config import InferenceConfig
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/inference/__init__.py", line 2, in <module>
    from .core import InferenceEngine
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/inference/core/__init__.py", line 1, in <module>
    from .engine import InferenceEngine
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/inference/core/engine.py", line 23, in <module>
    from colossalai.inference.modeling.policy import model_policy_map
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/inference/modeling/policy/__init__.py", line 2, in <module>
    from .nopadding_baichuan import NoPaddingBaichuanModelInferPolicy
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/inference/modeling/policy/nopadding_baichuan.py", line 6, in <module>
    from colossalai.inference.modeling.models.nopadding_baichuan import (
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/inference/modeling/models/nopadding_baichuan.py", line 11, in <module>
    from colossalai.inference.modeling.models.nopadding_llama import NopadLlamaMLP
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/inference/modeling/models/nopadding_llama.py", line 35, in <module>
    inference_ops = InferenceOpsLoader().load()
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/kernel_loader.py", line 83, in load
    return usable_exts[0].load()
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/extensions/cpp_extension.py", line 136, in load
    op_kernel = self.build_jit()
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/extensions/cuda_extension.py", line 86, in build_jit
    op_kernel = load(
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1309, in load
    return _jit_compile(
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1678, in _jit_compile
    version = JIT_EXTENSION_VERSIONER.bump_version_if_changed(
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/torch/utils/_cpp_extension_versioner.py", line 45, in bump_version_if_changed
    hash_value = hash_source_files(hash_value, source_files)
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/torch/utils/_cpp_extension_versioner.py", line 15, in hash_source_files
    with open(filename) as file:
FileNotFoundError: [Errno 2] No such file or directory: '/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/extensions/pybind/inference/inference.cpp'
E0612 11:56:53.465000 140063917000512 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 1114939) of binary: /home/sw/anaconda3/envs/colossalai/bin/python
Traceback (most recent call last):
  File "/home/sw/anaconda3/envs/colossalai/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
llama_generation.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-12_11:56:53
  host      : sw-black
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1114939)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Error: failed to run torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 llama_generation.py -m ~/llama8b/TensorRT-LLM/Meta-Llama-3-8B-Instruct --max_length 80 on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!

Command: 'cd /home/sw/ColossalAI/examples/inference/llama && export SHELL="/bin/bash" CONDA_EXE="/home/sw/anaconda3/bin/conda" LC_ADDRESS="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" PWD="/home/sw/ColossalAI/examples/inference/llama" LOGNAME="sw" XDG_SESSION_TYPE="tty" CONDA_PREFIX="/home/sw/anaconda3/envs/colossalai" MOTD_SHOWN="pam" HOME="/home/sw" LC_PAPER="en_US.UTF-8" LANG="en_US.UTF-8" LS_COLORS="rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:" CONDA_PROMPT_MODIFIER="(colossalai) " HF_HUB_ENABLE_HF_TRANSFER="1" HF_TOKEN="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" SSH_CONNECTION="192.168.178.223 56678 192.168.178.249 22" PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" LESSCLOSE="/usr/bin/lesspipe %s %s" XDG_SESSION_CLASS="user" LC_IDENTIFICATION="en_US.UTF-8" TERM="xterm-256color" LESSOPEN="| /usr/bin/lesspipe %s" USER="sw" CONDA_SHLVL="2" DISPLAY="localhost:10.0" SHLVL="1" LC_TELEPHONE="en_US.UTF-8" SYSTEMD_PAGER="less" LC_MEASUREMENT="en_US.UTF-8" XDG_SESSION_ID="49" CONDA_PYTHON_EXE="/home/sw/anaconda3/bin/python" XDG_RUNTIME_DIR="/run/user/1000" SSH_CLIENT="192.168.178.223 56678 22" CONDA_DEFAULT_ENV="colossalai" LC_TIME="en_US.UTF-8" GCC_COLORS="error=01;31:warning=01;35:note=01;36:caret=01;32:locus=01:quote=01" XDG_DATA_DIRS="/home/sw/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share:/usr/share" PATH="/home/sw/.local/bin:/home/sw/anaconda3/envs/colossalai/bin:/home/sw/anaconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin" DBUS_SESSION_BUS_ADDRESS="unix:path=/run/user/1000/bus" SSH_TTY="/dev/pts/5" CONDA_PREFIX_1="/home/sw/anaconda3" LC_NUMERIC="en_US.UTF-8" OLDPWD="/home/sw/ColossalAI/examples/inference" _="/home/sw/anaconda3/envs/colossalai/bin/colossalai" && torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 llama_generation.py -m ~/llama8b/TensorRT-LLM/Meta-Llama-3-8B-Instruct --max_length 80'

Exit code: 1

Stdout: already printed

Stderr: already printed

====== Training on All Nodes =====
127.0.0.1: failure

====== Stopping All Nodes =====
127.0.0.1: finish

Than after doing this: cd ColossalAI pip install .

This is the error during same run:

/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/torch/utils/_pytree.py:254: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
  warnings.warn(
/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
  warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/extensions/utils.py:96: UserWarning: [extension] The CUDA version on the system (12.4) does not match with the version (12.1) torch was compiled with. The mismatch is found in the minor version. As the APIs are compatible, we will allow compilation to proceed. If you encounter any issue when using the built kernel, please try to build it again with fully matched CUDA versions
  warnings.warn(
[extension] Compiling the JIT inference_ops_cuda kernel during runtime now
Traceback (most recent call last):
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/extensions/cpp_extension.py", line 132, in load
    op_kernel = self.import_op()
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/extensions/cpp_extension.py", line 61, in import_op
    return importlib.import_module(self.prebuilt_import_path)
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1004, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'colossalai._C.inference_ops_cuda'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sw/ColossalAI/examples/inference/llama/llama_generation.py", line 8, in <module>
    from colossalai.inference.config import InferenceConfig
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/inference/__init__.py", line 2, in <module>
    from .core import InferenceEngine
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/inference/core/__init__.py", line 1, in <module>
    from .engine import InferenceEngine
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/inference/core/engine.py", line 23, in <module>
    from colossalai.inference.modeling.policy import model_policy_map
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/inference/modeling/policy/__init__.py", line 2, in <module>
    from .nopadding_baichuan import NoPaddingBaichuanModelInferPolicy
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/inference/modeling/policy/nopadding_baichuan.py", line 3, in <module>
    from colossalai.inference.modeling.models.nopadding_baichuan import (
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/inference/modeling/models/nopadding_baichuan.py", line 13, in <module>
    from colossalai.inference.modeling.models.nopadding_llama import NopadLlamaMLP
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/inference/modeling/models/nopadding_llama.py", line 30, in <module>
    inference_ops = InferenceOpsLoader().load()
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/kernel_loader.py", line 83, in load
    return usable_exts[0].load()
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/extensions/cpp_extension.py", line 136, in load
    op_kernel = self.build_jit()
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/colossalai/kernel/extensions/cuda_extension.py", line 88, in build_jit
    op_kernel = load(
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1306, in load
    return _jit_compile(
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1710, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1797, in _write_ninja_file_and_build_library
    get_compiler_abi_compatibility_and_version(compiler)
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 359, in get_compiler_abi_compatibility_and_version
    if not check_compiler_ok_for_platform(compiler):
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 312, in check_compiler_ok_for_platform
    which = subprocess.check_output(['which', compiler], stderr=subprocess.STDOUT)
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['which', 'c++']' returned non-zero exit status 1.
[2024-06-12 12:05:03,961] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1202233) of binary: /home/sw/anaconda3/envs/colossalai/bin/python
Traceback (most recent call last):
  File "/home/sw/anaconda3/envs/colossalai/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/sw/anaconda3/envs/colossalai/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
llama_generation.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-12_12:05:03
  host      : sw-black
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1202233)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Error: failed to run torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 llama_generation.py -m ~/llama8b/TensorRT-LLM/Meta-Llama-3-8B-Instruct --max_length 80 on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!

Command: 'cd /home/sw/ColossalAI/examples/inference/llama && export SHELL="/bin/bash" CONDA_EXE="/home/sw/anaconda3/bin/conda" LC_ADDRESS="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" PWD="/home/sw/ColossalAI/examples/inference/llama" LOGNAME="sw" XDG_SESSION_TYPE="tty" CONDA_PREFIX="/home/sw/anaconda3/envs/colossalai" MOTD_SHOWN="pam" HOME="/home/sw" LC_PAPER="en_US.UTF-8" LANG="en_US.UTF-8" LS_COLORS="rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:" CONDA_PROMPT_MODIFIER="(colossalai) " HF_HUB_ENABLE_HF_TRANSFER="1" HF_TOKEN="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" SSH_CONNECTION="192.168.178.223 50527 192.168.178.249 22" PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" LESSCLOSE="/usr/bin/lesspipe %s %s" XDG_SESSION_CLASS="user" LC_IDENTIFICATION="en_US.UTF-8" TERM="xterm-256color" LESSOPEN="| /usr/bin/lesspipe %s" USER="sw" CONDA_SHLVL="2" SHLVL="1" LC_TELEPHONE="en_US.UTF-8" SYSTEMD_PAGER="less" LC_MEASUREMENT="en_US.UTF-8" XDG_SESSION_ID="52" CONDA_PYTHON_EXE="/home/sw/anaconda3/bin/python" XDG_RUNTIME_DIR="/run/user/1000" SSH_CLIENT="192.168.178.223 50527 22" CONDA_DEFAULT_ENV="colossalai" LC_TIME="en_US.UTF-8" GCC_COLORS="error=01;31:warning=01;35:note=01;36:caret=01;32:locus=01:quote=01" XDG_DATA_DIRS="/home/sw/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share:/usr/share" PATH="/home/sw/.local/bin:/home/sw/anaconda3/envs/colossalai/bin:/home/sw/anaconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin" DBUS_SESSION_BUS_ADDRESS="unix:path=/run/user/1000/bus" SSH_TTY="/dev/pts/5" CONDA_PREFIX_1="/home/sw/anaconda3" LC_NUMERIC="en_US.UTF-8" _="/home/sw/anaconda3/envs/colossalai/bin/colossalai" OLDPWD="/home/sw/ColossalAI/examples/inference" && torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 llama_generation.py -m ~/llama8b/TensorRT-LLM/Meta-Llama-3-8B-Instruct --max_length 80'

Exit code: 1

Stdout: already printed

Stderr: already printed

====== Training on All Nodes =====
127.0.0.1: failure

====== Stopping All Nodes =====
127.0.0.1: finish

Environment

Python 3.10.14 Torch 2.3.1+cu121 CUDA Version: 12.4 nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Thu_Mar_28_02:18:24_PDT_2024 Cuda compilation tools, release 12.4, V12.4.131 Build cuda_12.4.r12.4/compiler.34097967_0

gyt1145028706 commented 2 weeks ago

I have met the bug, too

teis-e commented 2 weeks ago

I never found the fix.

What did helped is to install it in a complete new virtual environment with specificity python 3.8. Then the pip install errors disappeared. But sadly i still ended up with some errors that i never got to fix.