Closed YuxinxinChen closed 10 months ago
Hi @YuxinxinChen, there are multiple points that is causing the installation to fail, namely that it is using CUDA 11.5 and PyTorch 1.13, both of which have long been deprecated. Could you try the instructions here for installation? It is also recommended to perform all the instructions inside a Conda environment, so that the steps can be reproduced on our end if the observed issue persists.
Hi @q10 , I am using the instructions from here. It also gives the same error when I use cuda 12.1, pytorch 2.1.0+cu121 and pip install fbgemm-gpu --index-url https://download.pytorch.org/whl/cu121
, as my above issue stated. I also checked my LD_LIBRARY_PATH
:
root@79f6f5f69f54:/usr# find . -name libtorch.so
./local/lib/python3.10/dist-packages/torch/lib/libtorch.so
root@79f6f5f69f54:/usr# find . -name "libnvidia-ml.so"
./local/cuda-12.2/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
root@79f6f5f69f54:/usr# echo $LD_LIBRARY_PATH
/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda-12.2/targets/x86_64-linux/lib/stubs
Where the libtorch.so
and libnvidia-ml.so
can be found, but still the error persists.
@q10 I am using docker, so I think you could also reproduce the error. The steps I used:
docker pull nvcr.io/nvidia/pytorch:23.10-py3
docker run -it --gpus all --rm nvcr.io/nvidia/pytorch:23.10-py3 /bin/bash
Inside the docker
pip install fbgemm-gpu --index-url https://download.pytorch.org/whl/cu121
The I use find to locate the libtorch.so
and libnvidia-ml.so
cd /usr
find . -name libtorch.so
find . -name "libnvidia-ml.so"
I will make sure the lib path is added to LD_LIBRARY_PATH
Then I run:
python -c "import torch; import fbgemm_gpu; print(torch.ops.fbgemm.merge_pooled_embeddings)"
and got:
/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 746, in __getattr__
op, overload_names = torch._C._jit_get_operation(qualified_op_name)
RuntimeError: No such operator fbgemm::jagged_2d_to_dense
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/__init__.py", line 22, in <module>
from . import _fbgemm_gpu_docs # noqa: F401, E402
File "/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/_fbgemm_gpu_docs.py", line 19, in <module>
torch.ops.fbgemm.jagged_2d_to_dense,
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 750, in __getattr__
raise AttributeError(
AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'jagged_2d_to_dense'
I was able to reproduce the error you ran into, and found this line:
libnvidia-ml.so.1: cannot open shared object file: No such file or directory
right before the AttributeError message. For some reason, your Docker setup doesn't have libnvidia-ml.so.1
available, even though libnvidia-ml.so
is available. You can create a symlink in the same directory where libnvidia-ml.so.1
is located and expose the directory with LD_LIBRARY_PATH
Note that the fbgemm_gpu package does not automatically install pytorch, so you will need to install pytorch nightly prior to running.
@q10 Thanks for your reply. However, after I create a symbolic link for libnvidia-ml.so.1
, I still get the same error:
root@ae0dd43f34c1:/workspace# python
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.__version__)
2.1.0a0+32f93b1
>>> torch.cuda.is_available()
True
>>> import fbgemm_gpu
/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 746, in __getattr__
op, overload_names = torch._C._jit_get_operation(qualified_op_name)
RuntimeError: No such operator fbgemm::jagged_2d_to_dense
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/__init__.py", line 22, in <module>
from . import _fbgemm_gpu_docs # noqa: F401, E402
File "/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/_fbgemm_gpu_docs.py", line 19, in <module>
torch.ops.fbgemm.jagged_2d_to_dense,
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 750, in __getattr__
raise AttributeError(
AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'jagged_2d_to_dense'
>>> exit()
root@ae0dd43f34c1:/workspace# echo $LD_LIBRARY_PATH
/usr/local/cuda-12.2/targets/x86_64-linux/lib/stubs:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
root@ae0dd43f34c1:/workspace# ls -al /usr/local/cuda-12.2/targets/x86_64-linux/lib/stubs
total 943
drwxr-xr-x 2 root root 27 Nov 17 22:20 .
drwxr-xr-x 4 root root 107 Oct 4 02:02 ..
-rw-r--r-- 1 root root 79832 Aug 16 05:31 libcublas.so
-rw-r--r-- 1 root root 38872 Aug 16 05:31 libcublasLt.so
-rw-r--r-- 1 root root 66272 Aug 16 05:28 libcuda.so
-rw-r--r-- 1 root root 9400 Aug 16 06:06 libcufft.so
-rw-r--r-- 1 root root 13496 Aug 16 06:06 libcufftw.so
-rw-r--r-- 1 root root 9400 Aug 16 05:37 libcurand.so
-rw-r--r-- 1 root root 111800 Aug 16 05:57 libcusolver.so
-rw-r--r-- 1 root root 29880 Aug 16 05:57 libcusolverMg.so
-rw-r--r-- 1 root root 54456 Aug 16 05:30 libcusparse.so
-rw-r--r-- 1 root root 5304 Aug 16 05:53 libnppc.so
-rw-r--r-- 1 root root 259256 Aug 16 05:53 libnppial.so
-rw-r--r-- 1 root root 136376 Aug 16 05:53 libnppicc.so
-rw-r--r-- 1 root root 177336 Aug 16 05:53 libnppidei.so
-rw-r--r-- 1 root root 263352 Aug 16 05:53 libnppif.so
-rw-r--r-- 1 root root 87224 Aug 16 05:53 libnppig.so
-rw-r--r-- 1 root root 42168 Aug 16 05:53 libnppim.so
-rw-r--r-- 1 root root 427192 Aug 16 05:53 libnppist.so
-rw-r--r-- 1 root root 9400 Aug 16 05:53 libnppisu.so
-rw-r--r-- 1 root root 54456 Aug 16 05:53 libnppitc.so
-rw-r--r-- 1 root root 222392 Aug 16 05:53 libnpps.so
-rw-r--r-- 1 root root 9400 Aug 16 05:39 libnvJitLink.so
-rw-r--r-- 1 root root 55064 Aug 16 05:12 libnvidia-ml.so
lrwxrwxrwx 1 root root 15 Nov 17 22:20 libnvidia-ml.so.1 -> libnvidia-ml.so
-rw-r--r-- 1 root root 13496 Aug 16 05:29 libnvjpeg.so
-rw-r--r-- 1 root root 5304 Aug 16 05:28 libnvrtc.so
root@ae0dd43f34c1:/workspace# ls /usr/local/lib/python3.10/dist-packages/torch/lib
libbackend_with_compiler.so libc10d_cuda_test.so libnvfuser_codegen.so libtorch_cpu.so libtorch_global_deps.so
libc10.so libcaffe2_nvrtc.so libshm.so libtorch_cuda.so libtorch_python.so
libc10_cuda.so libjitbackend_test.so libtorch.so libtorch_cuda_linalg.so libtorchbind_test.so
@q10
Note that the fbgemm_gpu package does not automatically install pytorch, so you will need to install pytorch nightly prior to running.
The docker images has pytorch installed with it. Are you suggesting the fbgemm_gou is not able to see the pytorch in the docker?
Hmm, this appears to be a problem. I was looking at the nightly builds, and the symbol torch::autograd::Node::name()
(_ZNK5torch8autograd4Node4nameEv
in mangled form) does appear to be undefined. We will investigate this more next week.
We generally install PyTorch and fbgemm_gpu through pip inside a Conda environment. This is to enable environment reproducibility for cases where we run into build / install issues, and this might account for why we haven't run into the undefined symbol problem yet.
Hi @q10, I've explored various approaches to resolve the issue. Initially, I attempted to build fbgemm_gpu from the source code, which succeeded. However, when I imported fbgemm_gpu in Python, it threw an "illegal instruction" error, leading to a forced termination of the entire Python interaction environment. Interestingly, a similar situation occurred during the installation from conda pip.
I thought sharing this additional information might be helpful in diagnosing the problem. Any insights or guidance you can provide would be much appreciated.
@YuxinxinChen Unfortunately the build instructions does not explicitly emphasize this, but use of the PyTorch installation that comes with the Docker image is discouraged, and one should always create the full environment (i.e. install PyTorch from scratch) inside a Conda environment for consistency and reproducibility.
You also mentioned earlier that the installed PyTorch version is 1.13, but that has been deprecated, and is likely the reason you are running into the undefined symbol error when loading the module.
This is what I'm running so far:
# Launch Docker instance (GPU not required)
docker run -it amazonlinux:2023 /bin/bash
# Install tools
yum update -y; yum install -y binutils findutils git pciutils sudo tar wget which
# Clone the repo
git clone --recurse-submodules https://github.com/pytorch/FBGEMM.git
# Load the build scripts
cd FBGEMM && . .github/scripts/setup_env.bash
# Set up Miniconda
setup_miniconda ~/local/miniconda3
# Set up a Conda environment (with CUDA 12.1 and PyTorch nightly for CUDA 12.1)
env_name=foo
test_setup_conda_environment $env_name 3.11 pip nightly cuda 12.1.0
# install FBGEMM_GPU from PIP (CUDA 12.1 variant)
install_fbgemm_gpu_pip $env_name nightly cuda 12.1.0
At this point, the last step will fail bc libnvidia-ml.so.1 is not found:
(base) bash-5.2# conda run -n foo python -c "import fbgemm_gpu"
libnvidia-ml.so.1: cannot open shared object file: No such file or directory
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/root/local/miniconda3/envs/foo/lib/python3.11/site-packages/fbgemm_gpu/__init__.py", line 23, in <module>
from . import _fbgemm_gpu_docs, sparse_ops # noqa: F401, E402 # noqa: F401, E402
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/local/miniconda3/envs/foo/lib/python3.11/site-packages/fbgemm_gpu/_fbgemm_gpu_docs.py", line 19, in <module>
torch.ops.fbgemm.jagged_2d_to_dense,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/local/miniconda3/envs/foo/lib/python3.11/site-packages/torch/_ops.py", line 820, in __getattr__
raise AttributeError(
AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'jagged_2d_to_dense'
ERROR conda.cli.main_run:execute(49): `conda run python -c import fbgemm_gpu` failed. (See above for error)
But after creating the symlink and specifying LD_LIBRARY_PATH
, the module load works:
# Create symlink
(base) bash-5.2# ln -s /root/local/miniconda3/envs/foo/lib/stubs/libnvidia-ml.so /root/local/miniconda3/envs/foo/lib/stubs/libnvidia-ml.so.1
# Load FBGEMM_GPU
(base) bash-5.2# LD_LIBRARY_PATH=/root/local/miniconda3/envs/foo/lib/stubs conda run -n foo python -c "import fbgemm_gpu; print(fbgemm_gpu.__version__)"
2023.11.21+cu121
(base) bash-5.2#
Undefined symbols still exist in libtorch as expected:
(base) bash-5.2# nm -gDCu /root/local/miniconda3/envs/foo/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so | sort | grep Node::name
U torch::autograd::Node::name() const
(base) bash-5.2#
but this doesn't affect the loading of the module.
Could you try running the forementioned steps on the NVIDIA docker image you have been using and let me know how it goes?
@q10 Thanks for your above detailed step! This is really helpful! After following your steps, I got the illegal instruction error:
[INSTALL] Successfully installed PyTorch through PyTorch PIP
(base) [root@b73d0643267c FBGEMM]# install_fbgemm_gpu_pip $env_name nightly cuda 12.1.0
################################################################################
# Install FBGEMM-GPU Package from PIP
#
# [2023-11-22T19:05:07.388Z] + install_fbgemm_gpu_pip fbgemm_gpu nightly cuda 12.1.0
################################################################################
################################################################################
# Install fbgemm_gpu (PyTorch PIP)
#
# [2023-11-22T19:05:07.391Z] + install_from_pytorch_pip fbgemm_gpu fbgemm_gpu nightly cuda 12.1.0
################################################################################
[CHECK] Network does not appear to be blocked.
################################################################################
# Install fbgemm_gpu (PyTorch PIP)
#
# [2023-11-22T19:05:07.578Z] + __extract_pip_arguments fbgemm_gpu fbgemm_gpu nightly cuda 12.1.0
################################################################################
[CHECK] Network does not appear to be blocked.
[INSTALL] Extracted package variant: cu121
[INSTALL] Attempting to install [fbgemm-gpu, nightly+cu121] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu121/ ...
[EXEC] [ATTEMPT 0/3] + conda run -n fbgemm_gpu pip install --pre fbgemm-gpu --extra-index-url https://download.pytorch.org/whl/nightly/cu121/
Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/nightly/cu121/
Collecting fbgemm-gpu
Downloading https://download.pytorch.org/whl/nightly/cu121/fbgemm_gpu-2023.11.22%2Bcu121-cp311-cp311-manylinux2014_x86_64.whl (338.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 338.3/338.3 MB 5.4 MB/s eta 0:00:00
Requirement already satisfied: numpy in /root/local/miniconda3/envs/fbgemm_gpu/lib/python3.11/site-packages (from fbgemm-gpu) (1.26.0)
Installing collected packages: fbgemm-gpu
Successfully installed fbgemm-gpu-2023.11.22+cu121
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
fbgemm-gpu 2023.11.22+cu121
[CHECK] The installed package [fbgemm-gpu, nightly] is the correct variant (cu121)
[INSTALL] Checking imports and symbols ...
/tmp/tmpf9cp_twp: line 3: 13744 Illegal instruction (core dumped) python -c 'import fbgemm_gpu'
ERROR conda.cli.main_run:execute(49): `conda run python -c import fbgemm_gpu` failed. (See above for error)
[CHECK] Python package 'fbgemm_gpu' was not found, or the package is broken!
This is the highlights of some of the steps:
[INSTALL] Successfully installed PyTorch through PyTorch PIP
(base) [root@b73d0643267c FBGEMM]# install_fbgemm_gpu_pip $env_name nightly cuda 12.1.0
Successfully installed fbgemm-gpu-2023.11.22+cu121
[INSTALL] Checking imports and symbols ...
/tmp/tmpf9cp_twp: line 3: 13744 Illegal instruction (core dumped) python -c 'import fbgemm_gpu'
ERROR conda.cli.main_run:execute(49): conda run python -c import fbgemm_gpu
failed. (See above for error)
I am not sure why this is coming. I will also try on a different machine and see if this is a machine problem. If you met this error before, could you please provide me some guidance for it? Many thanks!
@YuxinxinChen Hmm, we haven't observed into an illegal instruction problem in our setups. I wonder if the issue is hardware-related. Could you show us the output of nvidia-smi
?
One thing to note is that FBGEMM_GPU currently has official support for Volta and Ampere GPUs, but not the older architectures, so there is no guarantee of the library working on systems with older graphics cards.
Hi @q10, here is my nvidia-smi:
(base) yuxin420@mario:~$ nvidia-smi
Sat Nov 25 21:42:56 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-PCIE-32GB On | 00000000:02:00.0 Off | 0 |
| N/A 32C P0 36W / 250W| 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
It is volta GPU, so I guess it is not the GPU arch problem
@YuxinxinChen We can try one more thing - could you try to run only on the bare OS (i.e. no Docker) but with a Conda environment with all the CUDA and PyTorch packages installed inside that environment, and let me know how that goes?
Hi @q10, I am able to run get the fbgemm imported on gcloud
using the command you provided above! The error illegal instruction
issue should be related the machine I used in our lab, though I have no ideal why it gives me this error. Sorry for the trouble and thank you very much for the time you spent with me on this issue!
Again many thanks!
Best, Yuxin
@YuxinxinChen No problem and no worries. Do let us know if you end up figuring out what is different with the machine in your lab that is causing the crash as well as figuring out a way to address it. We try to collect as much user problems as we can, and your feedback helps us improve the overall FBGEMM user experience.
Hi Team,
I am trying to use fbgemm_gpu, however, I got problem at import step. Below is my errors:
The system: CUDA:
Pytorch:
The pip command I used to install fbgemm_gpu:
System:
I also tried other version of cuda, pytorch and fbgemm, unfortunately, I got the same error. The other version of cuda, pytorch and fbgemm_gpu version: cuda:
pytorch:
The pip command used to install fbgemm_gpu:
I also tried this combination and got the following error:
The cuda I used:
Pytorch and
The pip command used to install fbgemm_gpu:
Any help that could enable me to use fbgemm_gpu would be appreciated!
Best,
Yuxin