Error importing fbgemm_gpu

pytorch / FBGEMM

FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/

Other

1.18k stars 487 forks source link

Error importing fbgemm_gpu #2130

Closed YuxinxinChen closed 10 months ago

YuxinxinChen commented 11 months ago

Hi Team,

I am trying to use fbgemm_gpu, however, I got problem at import step. Below is my errors:

root@ef21f3d4fc03:/workspace# python
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) 
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
im>>> import fbgemm_gpu
/opt/conda/lib/python3.8/site-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/_ops.py", line 203, in __getattr__
    op, overload_names = torch._C._jit_get_operation(qualified_op_name)
RuntimeError: No such operator fbgemm::jagged_2d_to_dense

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.8/site-packages/fbgemm_gpu/__init__.py", line 22, in <module>
    from . import _fbgemm_gpu_docs  # noqa: F401, E402
  File "/opt/conda/lib/python3.8/site-packages/fbgemm_gpu/_fbgemm_gpu_docs.py", line 19, in <module>
    torch.ops.fbgemm.jagged_2d_to_dense,
  File "/opt/conda/lib/python3.8/site-packages/torch/_ops.py", line 207, in __getattr__
    raise AttributeError(f"'_OpNamespace' object has no attribute '{op_name}'") from e
AttributeError: '_OpNamespace' object has no attribute 'jagged_2d_to_dense'

The system: CUDA:

root@ef21f3d4fc03:/workspace# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

Pytorch:

>>> import torch
>>> print(torch.__version__)
1.13.0a0+d321be6
>>> torch.cuda.is_available()
True

The pip command I used to install fbgemm_gpu:

pip install fbgemm-gpu --index-url https://download.pytorch.org/whl/cu115

System:

root@ef21f3d4fc03:/workspace# uname -a
Linux ef21f3d4fc03 5.15.0-71-generic #78-Ubuntu SMP Tue Apr 18 09:00:29 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

I also tried other version of cuda, pytorch and fbgemm, unfortunately, I got the same error. The other version of cuda, pytorch and fbgemm_gpu version: cuda:

root@d5ac78cbd4a5:/workspace# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

pytorch:

>>> import torch
>>> print(torch.__version__)
2.1.0a0+32f93b1
>>> torch.cuda.is_available()
True

The pip command used to install fbgemm_gpu:

pip install fbgemm-gpu --index-url https://download.pytorch.org/whl/cu121

I also tried this combination and got the following error:

>>> import torch
>>> import torch
>>> print(torch.__version__)
2.1.0+cu121
>>> torch.cuda.is_available()
True
>>> import fbgemm_gpu
Illegal instruction (core dumped)

The cuda I used:

(fbgemm) yuxin420@mario:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

Pytorch and

>>> import torch
>>> print(torch.__version__)
2.1.0+cu121
>>> torch.cuda.is_available()
True

The pip command used to install fbgemm_gpu:

pip install fbgemm-gpu --index-url https://download.pytorch.org/whl/cu121

Any help that could enable me to use fbgemm_gpu would be appreciated!

Best,

Yuxin

q10 commented 11 months ago

Hi @YuxinxinChen, there are multiple points that is causing the installation to fail, namely that it is using CUDA 11.5 and PyTorch 1.13, both of which have long been deprecated. Could you try the instructions here for installation? It is also recommended to perform all the instructions inside a Conda environment, so that the steps can be reproduced on our end if the observed issue persists.

YuxinxinChen commented 11 months ago

Hi @q10 , I am using the instructions from here. It also gives the same error when I use cuda 12.1, pytorch 2.1.0+cu121 and pip install fbgemm-gpu --index-url https://download.pytorch.org/whl/cu121, as my above issue stated. I also checked my LD_LIBRARY_PATH:

root@79f6f5f69f54:/usr# find . -name libtorch.so
./local/lib/python3.10/dist-packages/torch/lib/libtorch.so
root@79f6f5f69f54:/usr# find . -name "libnvidia-ml.so"
./local/cuda-12.2/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
root@79f6f5f69f54:/usr# echo $LD_LIBRARY_PATH
/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda-12.2/targets/x86_64-linux/lib/stubs

Where the libtorch.so and libnvidia-ml.so can be found, but still the error persists.

YuxinxinChen commented 11 months ago

@q10 I am using docker, so I think you could also reproduce the error. The steps I used:

docker pull nvcr.io/nvidia/pytorch:23.10-py3
docker run -it --gpus all --rm nvcr.io/nvidia/pytorch:23.10-py3 /bin/bash

Inside the docker

pip install fbgemm-gpu --index-url https://download.pytorch.org/whl/cu121

The I use find to locate the libtorch.so and libnvidia-ml.so

cd /usr
find . -name libtorch.so
find . -name "libnvidia-ml.so"

I will make sure the lib path is added to LD_LIBRARY_PATH

Then I run:

python -c "import torch; import fbgemm_gpu; print(torch.ops.fbgemm.merge_pooled_embeddings)"

and got:

/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 746, in __getattr__
    op, overload_names = torch._C._jit_get_operation(qualified_op_name)
RuntimeError: No such operator fbgemm::jagged_2d_to_dense

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/__init__.py", line 22, in <module>
    from . import _fbgemm_gpu_docs  # noqa: F401, E402
  File "/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/_fbgemm_gpu_docs.py", line 19, in <module>
    torch.ops.fbgemm.jagged_2d_to_dense,
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 750, in __getattr__
    raise AttributeError(
AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'jagged_2d_to_dense'

q10 commented 11 months ago

I was able to reproduce the error you ran into, and found this line:

libnvidia-ml.so.1: cannot open shared object file: No such file or directory

right before the AttributeError message. For some reason, your Docker setup doesn't have libnvidia-ml.so.1 available, even though libnvidia-ml.so is available. You can create a symlink in the same directory where libnvidia-ml.so.1 is located and expose the directory with LD_LIBRARY_PATH

Note that the fbgemm_gpu package does not automatically install pytorch, so you will need to install pytorch nightly prior to running.

YuxinxinChen commented 10 months ago

@q10 Thanks for your reply. However, after I create a symbolic link for libnvidia-ml.so.1, I still get the same error:

root@ae0dd43f34c1:/workspace# python
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.__version__)
2.1.0a0+32f93b1
>>> torch.cuda.is_available()
True
>>> import fbgemm_gpu
/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 746, in __getattr__
    op, overload_names = torch._C._jit_get_operation(qualified_op_name)
RuntimeError: No such operator fbgemm::jagged_2d_to_dense

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/__init__.py", line 22, in <module>
    from . import _fbgemm_gpu_docs  # noqa: F401, E402
  File "/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/_fbgemm_gpu_docs.py", line 19, in <module>
    torch.ops.fbgemm.jagged_2d_to_dense,
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 750, in __getattr__
    raise AttributeError(
AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'jagged_2d_to_dense'
>>> exit()

root@ae0dd43f34c1:/workspace# echo $LD_LIBRARY_PATH
/usr/local/cuda-12.2/targets/x86_64-linux/lib/stubs:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64

root@ae0dd43f34c1:/workspace# ls -al /usr/local/cuda-12.2/targets/x86_64-linux/lib/stubs
total 943
drwxr-xr-x 2 root root     27 Nov 17 22:20 .
drwxr-xr-x 4 root root    107 Oct  4 02:02 ..
-rw-r--r-- 1 root root  79832 Aug 16 05:31 libcublas.so
-rw-r--r-- 1 root root  38872 Aug 16 05:31 libcublasLt.so
-rw-r--r-- 1 root root  66272 Aug 16 05:28 libcuda.so
-rw-r--r-- 1 root root   9400 Aug 16 06:06 libcufft.so
-rw-r--r-- 1 root root  13496 Aug 16 06:06 libcufftw.so
-rw-r--r-- 1 root root   9400 Aug 16 05:37 libcurand.so
-rw-r--r-- 1 root root 111800 Aug 16 05:57 libcusolver.so
-rw-r--r-- 1 root root  29880 Aug 16 05:57 libcusolverMg.so
-rw-r--r-- 1 root root  54456 Aug 16 05:30 libcusparse.so
-rw-r--r-- 1 root root   5304 Aug 16 05:53 libnppc.so
-rw-r--r-- 1 root root 259256 Aug 16 05:53 libnppial.so
-rw-r--r-- 1 root root 136376 Aug 16 05:53 libnppicc.so
-rw-r--r-- 1 root root 177336 Aug 16 05:53 libnppidei.so
-rw-r--r-- 1 root root 263352 Aug 16 05:53 libnppif.so
-rw-r--r-- 1 root root  87224 Aug 16 05:53 libnppig.so
-rw-r--r-- 1 root root  42168 Aug 16 05:53 libnppim.so
-rw-r--r-- 1 root root 427192 Aug 16 05:53 libnppist.so
-rw-r--r-- 1 root root   9400 Aug 16 05:53 libnppisu.so
-rw-r--r-- 1 root root  54456 Aug 16 05:53 libnppitc.so
-rw-r--r-- 1 root root 222392 Aug 16 05:53 libnpps.so
-rw-r--r-- 1 root root   9400 Aug 16 05:39 libnvJitLink.so
-rw-r--r-- 1 root root  55064 Aug 16 05:12 libnvidia-ml.so
lrwxrwxrwx 1 root root     15 Nov 17 22:20 libnvidia-ml.so.1 -> libnvidia-ml.so
-rw-r--r-- 1 root root  13496 Aug 16 05:29 libnvjpeg.so
-rw-r--r-- 1 root root   5304 Aug 16 05:28 libnvrtc.so

root@ae0dd43f34c1:/workspace# ls /usr/local/lib/python3.10/dist-packages/torch/lib
libbackend_with_compiler.so  libc10d_cuda_test.so   libnvfuser_codegen.so  libtorch_cpu.so          libtorch_global_deps.so
libc10.so                    libcaffe2_nvrtc.so     libshm.so              libtorch_cuda.so         libtorch_python.so
libc10_cuda.so               libjitbackend_test.so  libtorch.so            libtorch_cuda_linalg.so  libtorchbind_test.so

YuxinxinChen commented 10 months ago

@q10

Note that the fbgemm_gpu package does not automatically install pytorch, so you will need to install pytorch nightly prior to running.

The docker images has pytorch installed with it. Are you suggesting the fbgemm_gou is not able to see the pytorch in the docker?

q10 commented 10 months ago

Hmm, this appears to be a problem. I was looking at the nightly builds, and the symbol torch::autograd::Node::name() (_ZNK5torch8autograd4Node4nameEv in mangled form) does appear to be undefined. We will investigate this more next week.

We generally install PyTorch and fbgemm_gpu through pip inside a Conda environment. This is to enable environment reproducibility for cases where we run into build / install issues, and this might account for why we haven't run into the undefined symbol problem yet.

YuxinxinChen commented 10 months ago

Hi @q10, I've explored various approaches to resolve the issue. Initially, I attempted to build fbgemm_gpu from the source code, which succeeded. However, when I imported fbgemm_gpu in Python, it threw an "illegal instruction" error, leading to a forced termination of the entire Python interaction environment. Interestingly, a similar situation occurred during the installation from conda pip.

I thought sharing this additional information might be helpful in diagnosing the problem. Any insights or guidance you can provide would be much appreciated.

q10 commented 10 months ago

@YuxinxinChen Unfortunately the build instructions does not explicitly emphasize this, but use of the PyTorch installation that comes with the Docker image is discouraged, and one should always create the full environment (i.e. install PyTorch from scratch) inside a Conda environment for consistency and reproducibility.

You also mentioned earlier that the installed PyTorch version is 1.13, but that has been deprecated, and is likely the reason you are running into the undefined symbol error when loading the module.

This is what I'm running so far:

# Launch Docker instance (GPU not required)
docker run -it amazonlinux:2023  /bin/bash

# Install tools
yum update -y; yum install -y binutils findutils git pciutils sudo tar wget which

# Clone the repo
git clone --recurse-submodules https://github.com/pytorch/FBGEMM.git

# Load the build scripts
cd FBGEMM && . .github/scripts/setup_env.bash

# Set up Miniconda
setup_miniconda ~/local/miniconda3

# Set up a Conda environment (with CUDA 12.1 and PyTorch nightly for CUDA 12.1)
env_name=foo
test_setup_conda_environment $env_name 3.11 pip nightly cuda 12.1.0

# install FBGEMM_GPU from PIP (CUDA 12.1 variant)
install_fbgemm_gpu_pip $env_name nightly cuda 12.1.0

At this point, the last step will fail bc libnvidia-ml.so.1 is not found:

(base) bash-5.2# conda run -n foo python -c "import fbgemm_gpu"
libnvidia-ml.so.1: cannot open shared object file: No such file or directory

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/root/local/miniconda3/envs/foo/lib/python3.11/site-packages/fbgemm_gpu/__init__.py", line 23, in <module>
    from . import _fbgemm_gpu_docs, sparse_ops  # noqa: F401, E402  # noqa: F401, E402
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/local/miniconda3/envs/foo/lib/python3.11/site-packages/fbgemm_gpu/_fbgemm_gpu_docs.py", line 19, in <module>
    torch.ops.fbgemm.jagged_2d_to_dense,
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/local/miniconda3/envs/foo/lib/python3.11/site-packages/torch/_ops.py", line 820, in __getattr__
    raise AttributeError(
AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'jagged_2d_to_dense'
ERROR conda.cli.main_run:execute(49): `conda run python -c import fbgemm_gpu` failed. (See above for error)

But after creating the symlink and specifying LD_LIBRARY_PATH, the module load works:

# Create symlink
(base) bash-5.2# ln -s /root/local/miniconda3/envs/foo/lib/stubs/libnvidia-ml.so /root/local/miniconda3/envs/foo/lib/stubs/libnvidia-ml.so.1

# Load FBGEMM_GPU
(base) bash-5.2# LD_LIBRARY_PATH=/root/local/miniconda3/envs/foo/lib/stubs conda run -n foo python -c "import fbgemm_gpu; print(fbgemm_gpu.__version__)"
2023.11.21+cu121

(base) bash-5.2#

Undefined symbols still exist in libtorch as expected:

(base) bash-5.2# nm -gDCu /root/local/miniconda3/envs/foo/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so | sort | grep Node::name
                 U torch::autograd::Node::name() const
(base) bash-5.2#

but this doesn't affect the loading of the module.

Could you try running the forementioned steps on the NVIDIA docker image you have been using and let me know how it goes?

YuxinxinChen commented 10 months ago

@q10 Thanks for your above detailed step! This is really helpful! After following your steps, I got the illegal instruction error:

[INSTALL] Successfully installed PyTorch through PyTorch PIP
(base) [root@b73d0643267c FBGEMM]# install_fbgemm_gpu_pip $env_name nightly cuda 12.1.0
################################################################################
# Install FBGEMM-GPU Package from PIP
#
# [2023-11-22T19:05:07.388Z] + install_fbgemm_gpu_pip fbgemm_gpu nightly cuda 12.1.0
################################################################################

################################################################################
# Install fbgemm_gpu (PyTorch PIP)
#
# [2023-11-22T19:05:07.391Z] + install_from_pytorch_pip fbgemm_gpu fbgemm_gpu nightly cuda 12.1.0
################################################################################

[CHECK] Network does not appear to be blocked.
################################################################################
# Install fbgemm_gpu (PyTorch PIP)
#
# [2023-11-22T19:05:07.578Z] + __extract_pip_arguments fbgemm_gpu fbgemm_gpu nightly cuda 12.1.0
################################################################################

[CHECK] Network does not appear to be blocked.
[INSTALL] Extracted package variant: cu121
[INSTALL] Attempting to install [fbgemm-gpu, nightly+cu121] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu121/ ...
[EXEC] [ATTEMPT 0/3]    + conda run -n fbgemm_gpu pip install --pre fbgemm-gpu --extra-index-url https://download.pytorch.org/whl/nightly/cu121/
Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/nightly/cu121/
Collecting fbgemm-gpu
  Downloading https://download.pytorch.org/whl/nightly/cu121/fbgemm_gpu-2023.11.22%2Bcu121-cp311-cp311-manylinux2014_x86_64.whl (338.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 338.3/338.3 MB 5.4 MB/s eta 0:00:00
Requirement already satisfied: numpy in /root/local/miniconda3/envs/fbgemm_gpu/lib/python3.11/site-packages (from fbgemm-gpu) (1.26.0)
Installing collected packages: fbgemm-gpu
Successfully installed fbgemm-gpu-2023.11.22+cu121

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

fbgemm-gpu               2023.11.22+cu121
[CHECK] The installed package [fbgemm-gpu, nightly] is the correct variant (cu121)
[INSTALL] Checking imports and symbols ...
/tmp/tmpf9cp_twp: line 3: 13744 Illegal instruction     (core dumped) python -c 'import fbgemm_gpu'

ERROR conda.cli.main_run:execute(49): `conda run python -c import fbgemm_gpu` failed. (See above for error)
[CHECK] Python package 'fbgemm_gpu' was not found, or the package is broken!

This is the highlights of some of the steps:

[INSTALL] Successfully installed PyTorch through PyTorch PIP (base) [root@b73d0643267c FBGEMM]# install_fbgemm_gpu_pip $env_name nightly cuda 12.1.0 Successfully installed fbgemm-gpu-2023.11.22+cu121 [INSTALL] Checking imports and symbols ... /tmp/tmpf9cp_twp: line 3: 13744 Illegal instruction (core dumped) python -c 'import fbgemm_gpu' ERROR conda.cli.main_run:execute(49): conda run python -c import fbgemm_gpu failed. (See above for error)

I am not sure why this is coming. I will also try on a different machine and see if this is a machine problem. If you met this error before, could you please provide me some guidance for it? Many thanks!

q10 commented 10 months ago

@YuxinxinChen Hmm, we haven't observed into an illegal instruction problem in our setups. I wonder if the issue is hardware-related. Could you show us the output of nvidia-smi?

One thing to note is that FBGEMM_GPU currently has official support for Volta and Ampere GPUs, but not the older architectures, so there is no guarantee of the library working on systems with older graphics cards.

YuxinxinChen commented 10 months ago

Hi @q10, here is my nvidia-smi:

(base) yuxin420@mario:~$ nvidia-smi
Sat Nov 25 21:42:56 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-PCIE-32GB            On | 00000000:02:00.0 Off |                    0 |
| N/A   32C    P0               36W / 250W|      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

It is volta GPU, so I guess it is not the GPU arch problem

q10 commented 10 months ago

@YuxinxinChen We can try one more thing - could you try to run only on the bare OS (i.e. no Docker) but with a Conda environment with all the CUDA and PyTorch packages installed inside that environment, and let me know how that goes?

YuxinxinChen commented 10 months ago

Hi @q10, I am able to run get the fbgemm imported on gcloud using the command you provided above! The error illegal instruction issue should be related the machine I used in our lab, though I have no ideal why it gives me this error. Sorry for the trouble and thank you very much for the time you spent with me on this issue!

Again many thanks!

Best, Yuxin

q10 commented 10 months ago

@YuxinxinChen No problem and no worries. Do let us know if you end up figuring out what is different with the machine in your lab that is causing the crash as well as figuring out a way to address it. We try to collect as much user problems as we can, and your feedback helps us improve the overall FBGEMM user experience.