Closed arthur-morgan-712 closed 1 year ago
Hi @arthur-morgan-712
I am not sure what exactly is happening here, since in the trace log I am seeing it says that it is trying to load the extension module cpu-adam, however the ds-report says it is not installed! I am thinking maybe this might be a problem of the system caching this module somehow! Can you remove the /tmp/torch_extensions folder (rm -rf /tmp/torch_extentions/*) and retry this?
Thanks. Reza
Hi Reza,
Thanks for your reply! I tried removing it, it did have cpu_adam
as a subfolder, however the issue still persists.
Funnily enough, I tried running this on Colab and it seemed to have loaded it:
Loading extension module cpu_adam...
Time to load cpu_adam op: 33.38163924217224 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000030, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
And Colab's ds_report
also states that it hasn't been installed:
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires one of the following commands '['llvm-config', 'llvm-config-9']', but it does not exist!
[WARNING] sparse_attn requires CUDA version 10.1+, does not currently support >=11 or <10.1
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.7/dist-packages/torch']
torch version .................... 1.7.1+cu110
torch cuda version ............... 11.0
nvcc version ..................... 11.0
deepspeed install path ........... ['/usr/local/lib/python3.7/dist-packages/deepspeed']
deepspeed info ................... 0.3.13+22d5a1f, 22d5a1f, master
deepspeed wheel compiled w. ...... torch 1.7, cuda 11.0
Albeit, I did get another error on Colab (tcmalloc: large alloc 1200709632 bytes == .... Killing subprocess 346
) but I assume it's because the 11 GB video card was still rather small for this task. So maybe cpu_adam
not showing up as installed isn't out of the ordinary?
Quick update, I ran the command from the notebook (running run_seq2seq.py
) on the Amazon Linux instance and it seems to have provided a more detailed stack trace, I'm pasting it here:
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[1/2] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda-10.2/include -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include/TH -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda-10.2/include -isystem /home/ec2-user/anaconda3/envs/python3/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -L/usr/local/cuda-10.2/lib64 -lcudart -lcublas -g -Wno-reorder -march=native -fopenmp -D__AVX256__ -c /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
FAILED: cpu_adam.o
c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda-10.2/include -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include/TH -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda-10.2/include -isystem /home/ec2-user/anaconda3/envs/python3/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -L/usr/local/cuda-10.2/lib64 -lcudart -lcublas -g -Wno-reorder -march=native -fopenmp -D__AVX256__ -c /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp:4:10: fatal error: omp.h: No such file or directory
#include <omp.h>
^~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1539, in _run_ninja_build
env=env)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "examples/seq2seq/run_seq2seq.py", line 650, in <module>
main()
File "examples/seq2seq/run_seq2seq.py", line 590, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/ec2-user/SageMaker/checkstep-research/t5/transformers/src/transformers/trainer.py", line 892, in train
model, optimizer, lr_scheduler = init_deepspeed(self, num_training_steps=max_steps)
File "/home/ec2-user/SageMaker/checkstep-research/t5/transformers/src/transformers/integrations.py", line 417, in init_deepspeed
config_params=config,
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/__init__.py", line 125, in initialize
config_params=config_params)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 183, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 598, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 667, in _configure_basic_optimizer
adamw_mode=effective_adam_w_mode)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/adam/cpu_adam.py", line 78, in __init__
self.ds_opt_adam = CPUAdamBuilder().load()
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 215, in load
return self.jit_load(verbose)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 252, in jit_load
verbose=verbose)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 997, in load
keep_intermediates=keep_intermediates)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1202, in _jit_compile
with_cuda=with_cuda)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1300, in _write_ninja_file_and_build_library
error_prefix="Error building extension '{}'".format(name))
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1555, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'cpu_adam'
Killing subprocess 6428
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/launcher/launch.py", line 171, in <module>
main()
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/launcher/launch.py", line 161, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ec2-user/anaconda3/envs/python3/bin/python3.6', '-u', 'examples/seq2seq/run_seq2seq.py', '--local_rank=0', '--model_name_or_path', 'google/mt5-small', '--output_dir', 'output_dir', '--adam_eps', '1e-06', '--evaluation_strategy=steps', '--do_train', '--label_smoothing', '0.1', '--learning_rate', '3e-5', '--logging_first_step', '--logging_steps', '1000', '--max_source_length', '128', '--max_target_length', '128', '--num_train_epochs', '1', '--overwrite_output_dir', '--per_device_train_batch_size', '16', '--predict_with_generate', '--sortish_sampler', '--val_max_target_length', '128', '--warmup_steps', '500', '--max_train_samples', '2000', '--max_val_samples', '500', '--task', 'translation_en_to_ro', '--dataset_name', 'wmt16', '--dataset_config', 'ro-en', '--source_prefix', 'translate English to Romanian: ', '--deepspeed', 'ds_config.json', '--fp16']' returned non-zero exit status 1.
I was able to get the example from the notebook going after I downgraded the DeepSpeed version to 0.3.10. I do have a follow-up question though: Correct me if I'm wrong, but the only way to use DeepSpeed would be to use the HuggingFace Trainer
class? At least that's what I can find on HuggingFace (also trying to implement a custom script without using Trainer
resulted in an unrecognized arguments: --local_rank=0
error, even though I wasn't passing that argument). If that's the case, does this mean I cannot use DeepSpeed to train a classification model passing different class weights to each class, because there's no such parameter to pass to Trainer
?
We already have examples for running for some transformer networks. For this argument, I think you might just add __local_rank__ to your parser arguments the same as here.
I was able to get the example from the notebook going after I downgraded the DeepSpeed version to 0.3.10.
So is this the issue with 0.3.13 of DeepSpeed? Because I'm facing the same issue as well with 0.3.13.
Also, are you able to run HuggingFace-4.4.2 with DeepSpeed-0.3.10 ? I think you should've downgraded to HuggingFace-4.3.x.
Hi @arthur-morgan-712 ,
Could you try building deepspeed ops while installing as suggested here?
We already have examples for running for some transformer networks. For this argument, I think you might just add local_rank to your parser arguments the same as here.
This is no longer needed in deepspeed
since https://github.com/microsoft/DeepSpeed/pull/825 and transformers
master has been adjusted accordingly. You just need to have env LOCAL_RANK
to be set.
I do have a follow-up question though: Correct me if I'm wrong, but the only way to use DeepSpeed would be to use the HuggingFace Trainer class?
Not at all. You can do your own integration and not rely on the HF Trainer.
If you do use transformers
Trainer for a time being while this is all new you must use the transformers
master
branch as frequent deepspeed-related updates are made.
If you have build problems please make sure you read:
https://huggingface.co/transformers/main_classes/trainer.html#installation-notes
though looking at OP I think you have all the right components. Just check that PATH
/LD_LIBRARY_PATH
are good.
Perhaps try to pre-build deepspeed: https://github.com/microsoft/DeepSpeed/issues/885#issuecomment-808339237
Thanks @stas00 for clarifying this : )
And the error is right there in your report: https://github.com/microsoft/DeepSpeed/issues/889#issuecomment-806526657
c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda-10.2/include -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include/TH -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda-10.2/include -isystem /home/ec2-user/anaconda3/envs/python3/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -L/usr/local/cuda-10.2/lib64 -lcudart -lcublas -g -Wno-reorder -march=native -fopenmp -DAVX256 -c /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp:4:10:
fatal error: omp.h: No such file or directory
include
You're missing the right build tools. omp.h
is missing. It should be in your gcc
dev package. e.g. on my machine it's under:
/usr/lib/gcc/x86_64-linux-gnu/6/include/omp.h
/usr/lib/gcc/x86_64-linux-gnu/7/include/omp.h
/usr/lib/gcc/x86_64-linux-gnu/9/include/omp.h
@RezaYazdaniAminabadi, perhaps ds_report
could somehow check that all the build components are there? e.g. it could attempt to build some dummy very basic extension when run? not sure on the details. Clearly here gcc-dev tools are either missing or misconfigured.
Still running into:
ImportError: No module named 'cpu_adam'
CUDA 10.2 deepspeed 0.4.3 pytorch 1.8.1
Tried down grading deepspeed to 0.3.10 and ran into: "deepspeed>=0.4.3 is required for a normal functioning of this module, but found deepspeed==0.3.10."
Any other potential solutions to date or still open??
I also occur that. before AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam' The error show: cannot make a dir in /tmp/torch_extensions/build for cpu_adam. So I change the DEFAULT_TORCH_EXTENSION_PATH in the file /anaconda3/envs/XXXXX/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py then it works
This sounds like a permission issue. Try to set TMPDIR
to another dir that is writable by you?
e.g.:
mkdir ~/tmp
export TMPDIR=~/tmp
... do the build here ...
Try to set TMPDIR to another dir that is writable by you?
That won't work because deepspeed hardcodes the default extension path to be /tmp/torch_extensions
The default is not used if TORCH_EXTENSIONS_DIR is set in the environment, but it would certainly be an improvement for it to follow TMPDIR
(especially as TORCH is not in the whitelist of environment variable prefixes which the deepspeed distributed launcher automatically plumbs through to worker processes)
@stas00 @RezaYazdaniAminabadi huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
^~~~~
compilation terminated.
[2/3] /root/anaconda3/envs/bitten/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/bitten/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/TH -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/THC -isystem /root/anaconda3/envs/bitten/include -isystem /root/anaconda3/envs/bitten/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -D_CUDA_NO_HALF_CONVERSIONS -D_CUDA_NO_BFLOAT16_CONVERSIONS -D_CUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U_CUDA_NO_HALF_OPERATORS -U_CUDA_NO_HALFCONVERSIONS -U_CUDA_NO_HALF2OPERATORS -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
FAILED: custom_cuda_kernel.cuda.o
/root/anaconda3/envs/bitten/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/bitten/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/TH -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/THC -isystem /root/anaconda3/envs/bitten/include -isystem /root/anaconda3/envs/bitten/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -D_CUDA_NO_HALF_CONVERSIONS -D_CUDA_NO_BFLOAT16CONVERSIONS -D_CUDA_NO_HALF2OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U_CUDA_NO_HALFOPERATORS -U_CUDA_NO_HALFCONVERSIONS -U_CUDA_NO_HALF2OPERATORS -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
In file included from /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/context.h:11:0,
from /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/custom_cuda_layers.h:16,
from /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu:1:
/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/gemm_test.h:6:10: fatal error: cuda_profiler_api.h: No such file or directory
^~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
subprocess.run(
File "/root/anaconda3/envs/bitten/lib/python3.8/subprocess.py", line 512, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "run_clm.py", line 478, in main() File "run_clm.py", line 441, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/transformers/trainer.py", line 1112, in train deepspeed_engine, optimizer, lr_scheduler = deepspeed_init( File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/transformers/deepspeed.py", line 355, in deepspeedinit model, optimizer, , lr_scheduler = deepspeed.initialize( File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/init.py", line 125, in initialize engine = DeepSpeedEngine(args=args, File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 330, in init self._configure_optimizer(optimizer, model_parameters) File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1195, in _configure_optimizer basic_optimizer = self._configure_basic_optimizer(model_parameters) File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1266, in _configure_basic_optimizer optimizer = DeepSpeedCPUAdam(model_parameters, File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in init self.ds_opt_adam = CPUAdamBuilder().load() File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 460, in load return self.jit_load(verbose) File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 495, in jit_load op_module = load( File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1284, in load return _jit_compile( File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile _write_ninja_file_and_build_library( File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1623, in _write_ninja_file_and_build_library _run_ninja_build( File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1916, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'cpu_adam' Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7efc99bbd1f0> Traceback (most recent call last): File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 108, in del AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam' [2023-01-09 19:07:12,824] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 7728 [2023-01-09 19:07:12,826] [ERROR] [launch.py:324:sigkill_handler] AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
@arthur-morgan-712, you have a problem with your cuda environment:
/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/gemm_test.h:6:10:
fatal error: cuda_profiler_api.h: No such file or directory
#include <cuda_profiler_api.h>
properly install the cuda environment, including all dev header files and do the run again and it'll work.
on ubuntu I usually recommend the nvidia cuda pre-packaged .deb files, but be careful the latest cuda is already in 12.x so you make sure you're installing the same major version as pytorch - most likely cuda-11.x - I think 11.8 is the latest in that line.
e.g. on my system the missing file is here:
/usr/local/cuda-11.8/targets/x86_64-linux/include/cuda_profiler_api.h
In my case, this BUG is due to ninja compile error, you can change directory to ~/.cache/torch_extensions/cpu_adam
, then run ninja -v
to see the error details.
Closing this issue as the original issue was resolved. If anyone is having issues with this, please open a new issue and link this one and we would be happy to take a look.
In my case, this BUG is due to ninja compile error, you can change directory to
~/.cache/torch_extensions/cpu_adam
, then runninja -v
to see the error details.
hi @Chiang ,
i follow your advice to run ninja -v
under dir /root/.cache/torch_extensions/py310_cu118/cpu_adam
, and get error info below, please check and tell me how to fix this problem
[1/1] c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/root/miniconda3/envs/llmtest/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
FAILED: cpu_adam.so
c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/root/miniconda3/envs/llmtest/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
/usr/bin/ld: cannot find -lcurand
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.
/usr/bin/ld: cannot find -lcurand
It can't find your cuda install - if you have it installed search for where libcurand.so
is on your fs.
e.g. on my machine it's: /usr/local/cuda-12.1/targets/x86_64-linux/lib/libcurand.so.10.3.2.106
so in that case adding export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.1/targets/x86_64-linux/lib
should help finding that shared object. You need to edit this path to where your cuda is installed if any.
But most likely you don't have cuda installed. You can install it via apt
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#package-manager-installation or even inside a conda environment via https://anaconda.org/conda-forge/cudatoolkit if you don't have sudo
access. (but I see it's only 11.8 - I don't know if you need cuda-12.x or not).
I haven't tried, but this is probably the right package if you want to install it into your conda and has all the latest versions as well https://anaconda.org/nvidia/cuda-libraries - use the same version as your pytorch, to check which pytorch cuda version is used run:
python -c 'import torch; print(f"pt={torch.__version__}, cuda={torch.version.cuda}")'
/usr/bin/ld: cannot find -lcurand
It can't find your cuda install - if you have it installed search for where
libcurand.so
is on your fs.e.g. on my machine it's:
/usr/local/cuda-12.1/targets/x86_64-linux/lib/libcurand.so.10.3.2.106
so in that case addingexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.1/targets/x86_64-linux/lib
should help finding that shared object. You need to edit this path to where your cuda is installed if any.But most likely you don't have cuda installed. You can install it via
apt
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#package-manager-installation or even inside a conda environment via https://anaconda.org/conda-forge/cudatoolkit if you don't havesudo
access. (but I see it's only 11.8 - I don't know if you need cuda-12.x or not).I haven't tried, but this is probably the right package if you want to install it into your conda and has all the latest versions as well https://anaconda.org/nvidia/cuda-libraries - use the same version as your pytorch, to check which pytorch cuda version is used run:
python -c 'import torch; print(f"pt={torch.__version__}, cuda={torch.version.cuda}")'
hi @stas00 , it's not working, i change the env var LD_LIBRARY_PATH but the err is same as before
@stas00 pls see the detail below
[1/1] c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/root/miniconda3/envs/llmtest/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
FAILED: cpu_adam.so
c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/root/miniconda3/envs/llmtest/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
/usr/bin/ld: cannot find -lcurand
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.
i export the LD_LIBRARY_PATH first and run the ninja -v under concern folder but it has no change for linkage err, Why?
I see you have used /root/miniconda3/envs/llmtest/lib/python3.10/site-packages/torch/lib
for LD_LIBRARY_PATH
Did you check that you have libcurand.so
under /root/miniconda3/envs/llmtest/lib/python3.10/site-packages/torch/lib
?
and which conda package did you install?
I see you have used
/root/miniconda3/envs/llmtest/lib/python3.10/site-packages/torch/lib
forLD_LIBRARY_PATH
Did you check that you have
libcurand.so
under/root/miniconda3/envs/llmtest/lib/python3.10/site-packages/torch/lib
?and which conda package did you install?
I link the libcurand.so to the folder and still no change. and i find it works good if i add a link dir -L/usr/local/cuda/lib64 in ldflags of build.ninja file.
it seems like deepspeed gen a wrong ninja config and cause the program fail. i think this should be a bug and require a fix
@stas00 for detail info, pls refer the 5813 issue i mention above
I link the libcurand.so to the folder and still no change. and i find it works good if i add a link dir -L/usr/local/cuda/lib64 in ldflags of build.ninja file.
In which case you need to set:
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib64
To automate this see: https://askubuntu.com/questions/210884/setting-ld-library-path-for-cuda
The other solution that often helps is to set CUDA_HOME
export CUDA_HOME=/usr/local/cuda
I link the libcurand.so to the folder and still no change. and i find it works good if i add a link dir -L/usr/local/cuda/lib64 in ldflags of build.ninja file.
In which case you need to set:
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib64
To automate this see: https://askubuntu.com/questions/210884/setting-ld-library-path-for-cuda
The other solution that often helps is to set
CUDA_HOME
export CUDA_HOME=/usr/local/cuda
nope, the env LD_LIBRARY_PATH take effect at runtime so it is only used when program loaded. and for now the cpu_adam fail at compile time. So LD_LIBRARY_PATH provides no help. I manuall modify the build.ninja
and run ninja -v
under ~/.cache/torch_extensions/py310_cu118/cpu_adam
, it can be good. However, if i run the whole training program at start, the deepspeed will overide the build.ninja file to wrong setting again and it still crush at compile time
What I shared works for me. That's what I use on my desktop to build deepspeed.
What I shared works for me. That's what I use on my desktop to build deepspeed. but it's not work for me. To be more specifically, i encounter this problem when i use llama factory intergrated with deepspeed to train a model, you can follow the link below to get more info.(skip the chinese and focus on output and logs if you have difficulty in reading chinese content)
Hey guys, I'm having a problem getting DeepSpeed working with XLM-Roberta. I'm trying to run it on an Amazon Linux machine, which is based on Red Hat. Here are a some versions of packages/dependencies I'm using:
cuda version: 10.2
transformers: 4.4.2
pytorch: 1.7.1
deepspeed: 0.3.13
gcc/c++/g++: (GCC) 7.2.1 20170915 (Red Hat 7.2.1-2)
I must admit I had some issues upgrading the CUDA version from the default 10.0 on the instance to 10.2 and GCC from 4.8.5 to 7.2.1 but since I don't get the error that the torch and installed CUDA versions are different and that GCC has a version lower than 5, I'd assume I'm in the clear.
Here's the essential part of the code I'm running (from a notebook):
Here's the content of my config file:
Here's the output of my
ds_config
:And finally, here's the stack trace:
Thanks in advance for your help!