training porblem? - Githubissues

Wanghc233 commented 9 months ago

When i train vae ，the following proble occured : Using /home/whc/.cache/torch_extensions as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/whc/.cache/torch_extensions/emd_ext/build.ninja... Building extension module emd_ext... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/2] /usr/local/cuda-11.0/bin/nvcc -DTORCH_EXTENSION_NAME=emd_ext -DTORCH_API_INCLUDE_EXTENSION_H -isystem /home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.0/include -isystem /home/whc/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++14 -c /home/whc/LION/third_party/PyTorchEMD/cuda/emd_kernel.cu -o emd_kernel.cuda.o FAILED: emd_kernel.cuda.o /usr/local/cuda-11.0/bin/nvcc -DTORCH_EXTENSION_NAME=emd_ext -DTORCH_API_INCLUDE_EXTENSION_H -isystem /home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.0/include -isystem /home/whc/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++14 -c /home/whc/LION/third_party/PyTorchEMD/cuda/emd_kernel.cu -o emd_kernel.cuda.o nvcc fatal : Unsupported gpu architecture 'compute_86' ninja: build stopped: subcommand failed. 2023-11-20 16:21:35.117 | ERROR | utils.utils:init_processes:1158 - An error has been caught in function 'init_processes', process 'MainProcess' (1741727), thread 'MainThread' (140035078779840): Traceback (most recent call last):

File "/home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1516, in _run_ninja_build subprocess.run( │ └ <function run at 0x7f5c743be430> └ <module 'subprocess' from '/home/whc/miniconda3/envs/lion_env/lib/python3.8/subprocess.py'> File "/home/whc/miniconda3/envs/lion_env/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, │ │ │ └ ['ninja', '-v'] │ │ └ <subprocess.Popen object at 0x7f5ab08a1520> │ └ 1 └ <class 'subprocess.CalledProcessError'>

subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "train_dist.py", line 251, in utils.init_processes(0, size, main, args, config) │ │ │ │ │ └ CfgNode({'dpm_ckpt': '', 'clipforge': CfgNode({'clip_model': 'ViT-B/32', 'enable': 0, 'feat_dim': 512}), 'eval_trainnll': 0, ... │ │ │ │ └ Namespace(autocast_eval=True, autocast_train=True, config='none', data='/tmp/nvae-diff/data', dataset='cifar10', distributed=... │ │ │ └ <function main at 0x7f5bc982a160> │ │ └ 1 │ └ <function init_processes at 0x7f5bc98263a0> └ <module 'utils.utils' from '/home/whc/LION/utils/utils.py'>

File "/home/whc/LION/utils/utils.py", line 1158, in init_processes fn(args, config) │ │ └ CfgNode({'dpm_ckpt': '', 'clipforge': CfgNode({'clip_model': 'ViT-B/32', 'enable': 0, 'feat_dim': 512}), 'eval_trainnll': 0, ... │ └ Namespace(autocast_eval=True, autocast_train=True, config='none', data='/tmp/nvae-diff/data', dataset='cifar10', distributed=... └ <function main at 0x7f5bc982a160>

File "train_dist.py", line 31, in main trainer_lib = importlib.import_module(config.trainer.type) │ │ └ CfgNode({'dpm_ckpt': '', 'clipforge': CfgNode({'clip_model': 'ViT-B/32', 'enable': 0, 'feat_dim': 512}), 'eval_trainnll': 0, ... │ └ <function import_module at 0x7f5c7481ad30> └ <module 'importlib' from '/home/whc/miniconda3/envs/lion_env/lib/python3.8/importlib/init.py'>

File "/home/whc/miniconda3/envs/lion_env/lib/python3.8/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) │ │ │ │ │ └ 0 │ │ │ │ └ None │ │ │ └ 0 │ │ └ 'trainers.hvae_trainer' │ └ <function _gcd_import at 0x7f5c74943430> └ <module 'importlib._bootstrap' (frozen)> File "", line 1014, in _gcd_import File "", line 991, in _find_and_load File "", line 975, in _find_and_load_unlocked File "", line 671, in _load_unlocked File "", line 843, in exec_module File "", line 219, in _call_with_frames_removed

File "/home/whc/LION/trainers/hvae_trainer.py", line 18, in from trainers.base_trainer import BaseTrainer

File "/home/whc/LION/trainers/base_trainer.py", line 19, in from utils.evaluation_metrics_fast import print_results

File "/home/whc/LION/utils/evaluation_metrics_fast.py", line 24, in from third_party.PyTorchEMD.emd_nograd import earth_mover_distance_nograd

File "/home/whc/LION/third_party/PyTorchEMD/emd_nograd.py", line 4, in from third_party.PyTorchEMD.backend import emd_cuda_dynamic as emd_cuda

File "/home/whc/LION/third_party/PyTorchEMD/backend.py", line 10, in emd_cuda_dynamic = load(name='emd_ext', └ <function load at 0x7f5aafd78280>

File "/home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 969, in load return _jit_compile( └ <function _jit_compile at 0x7f5aafd783a0> File "/home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1176, in _jit_compile _write_ninja_file_and_build_library( └ <function _write_ninja_file_and_build_library at 0x7f5aafd784c0> File "/home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1280, in _write_ninja_file_and_build_library _run_ninja_build( └ <function _run_ninja_build at 0x7f5aafd78940> File "/home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1538, in _run_ninja_build raise RuntimeError(message) from e └ "Error building extension 'emd_ext'"

RuntimeError: Error building extension 'emd_ext' if i input : export TORCH_CUDA_ARCH_LIST="7.5" the another problem may orrcued: CUDA kernel failed : no kernel image is available for execution on the device void avg_voxelize(int, int, int, int, int, int, const int, const float, int, int, float*) at L:118 in /home/whc/LION/third_party/pvcnn/functional/src/voxelization/vox.cu

ZENGXH commented 8 months ago

it seems the building of emd failed nvcc fatal : Unsupported gpu architecture 'compute_86'

what's the gpu are you using?
could you try if you are able to install thie repo? https://github.com/daerduoCarey/PyTorchEMD
it's possible to remove the requirement of EMD, it's only used during evaluation, not in training. This require comment out some related code calling & import EMD.

Wanghc233 commented 8 months ago

i use rtx3090,i input "export TORCH_CUDA_ARCH_LIST="8.0"can slove this "nvcc fatal : Unsupported gpu architecture 'compute_86'" problem.

ZENGXH commented 7 months ago

Thanks for the update!

Wanghc233 commented 7 months ago

哈哈哈祝大佬早日毕业~

Philcalab commented 7 months ago

i use rtx3090,i input "export TORCH_CUDA_ARCH_LIST="8.0"can slove this "nvcc fatal : Unsupported gpu architecture 'compute_86'" problem.

请问export TORCH_CUDA_ARCH_LIST="8.0"是放在bashrc文件的最后吗

Wanghc233 commented 7 months ago

终端输一次就行，不用改全局环境，不然你的其他代码可能运行不了，先在终端输入export TORCH_CUDA_ARCH_LIST="8.0"回车后在输入python train.py

nv-tlabs / LION

training porblem? #58