Open AlienLiang23 opened 10 months ago
Hi @AlienLiang23 we currently don't support Intel GPU as far as Im aware of, but in any case, could you show us the instructions that you ran to perform the installation and the full log of the error message?
We generally recommend setting up FBGEMM_GPU and all of its dependencie inside a Conda environment, per our installation instructions.
Thanks for your reply@q10. fbgemm was installed while building torch. found that issue, I uninstalled it using
pip uninstall fbgemm-gpu
and re-installed it using
pip install fbgemm-gpu
mentioned that on A100, this same problem was resolved by reinstalling.
full log:
Testing model torchrec_dlrma
Testing with training mode.
Test amp with dt: torch.bfloat16
loading model: 0it [00:00, ?it/s]
libcudart.so.12: cannot open shared object file: No such file or directory
Traceback (most recent call last):
File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/torch/_ops.py", line 757, in __getattr__
op, overload_names = torch._C._jit_get_operation(qualified_op_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: No such operator fbgemm::jagged_2d_to_dense
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/gta_local/yongliang/frameworks.ai.pytorch.private-gpu/benchmarks/dynamo/torchbench.py", line 481, in <module>
torchbench_main()
File "/home/gta_local/yongliang/frameworks.ai.pytorch.private-gpu/benchmarks/dynamo/torchbench.py", line 477, in torchbench_main
main(TorchBenchmarkRunner(), original_dir)
File "/home/gta_local/yongliang/frameworks.ai.pytorch.private-gpu/benchmarks/dynamo/common.py", line 3034, in main
process_entry(0, runner, original_dir, args)
File "/home/gta_local/yongliang/frameworks.ai.pytorch.private-gpu/benchmarks/dynamo/common.py", line 2991, in process_entry
return maybe_fresh_cache(
^^^^^^^^^^^^^^^^^^
File "/home/gta_local/yongliang/frameworks.ai.pytorch.private-gpu/benchmarks/dynamo/common.py", line 1654, in inner
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/gta_local/yongliang/frameworks.ai.pytorch.private-gpu/benchmarks/dynamo/common.py", line 3444, in run
) = runner.load_model(
^^^^^^^^^^^^^^^^^^
File "/home/gta_local/yongliang/frameworks.ai.pytorch.private-gpu/benchmarks/dynamo/torchbench.py", line 313, in load_model
module = importlib.import_module(c)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 940, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/home/gta_local/yongliang/benchmark/torchbenchmark/canary_models/torchrec_dlrm/__init__.py", line 7, in <module>
from .data.dlrm_dataloader import get_dataloader
File "/home/gta_local/yongliang/benchmark/torchbenchmark/canary_models/torchrec_dlrm/data/dlrm_dataloader.py", line 13, in <module>
from torchrec.datasets.criteo import (
File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/torchrec/__init__.py", line 8, in <module>
import torchrec.distributed # noqa
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/torchrec/distributed/__init__.py", line 36, in <module>
from torchrec.distributed.model_parallel import DistributedModelParallel # noqa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/torchrec/distributed/model_parallel.py", line 21, in <module>
from torchrec.distributed.planner import EmbeddingShardingPlanner, Topology
File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/torchrec/distributed/planner/__init__.py", line 22, in <module>
from torchrec.distributed.planner.planners import EmbeddingShardingPlanner # noqa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/torchrec/distributed/planner/planners.py", line 19, in <module>
from torchrec.distributed.planner.constants import BATCH_SIZE, MAX_SIZE
File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/torchrec/distributed/planner/constants.py", line 10, in <module>
from torchrec.distributed.embedding_types import EmbeddingComputeKernel
File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/torchrec/distributed/embedding_types.py", line 14, in <module>
from fbgemm_gpu.split_table_batched_embeddings_ops_training import EmbeddingLocation
File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/fbgemm_gpu/__init__.py", line 22, in <module>
from . import _fbgemm_gpu_docs # noqa: F401, E402
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/fbgemm_gpu/_fbgemm_gpu_docs.py", line 19, in <module>
torch.ops.fbgemm.jagged_2d_to_dense,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/torch/_ops.py", line 761, in __getattr__
raise AttributeError(
AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'jagged_2d_to_dense'
@AlienLiang23 based on the logs pasted, the actual error appears to be:
libcudart.so.12: cannot open shared object file: No such file or directory
Unfortunately, every loading error ends up with the signature AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'jagged_2d_to_dense'
, which can be cryptic.
As stated in the error message, you will need to install libcudart.so into your PATH. I believe this should be available when you install the full CUDA package onto your system.
Hi, I tried to run torchrec_dlrm on torchbench base on Intel GPU and got this error:
I tried to re-install fbgemm-gpu as the document recommended but it couldn't help. Is this issue expected on Intel device?