pytorch / FBGEMM

FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/
Other
1.2k stars 494 forks source link

RuntimeError: No such operator fbgemm::jagged_2d_to_dense #2191

Open AlienLiang23 opened 10 months ago

AlienLiang23 commented 10 months ago

Hi, I tried to run torchrec_dlrm on torchbench base on Intel GPU and got this error:

Traceback (most recent call last):
  File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/torch/_ops.py", line 757, in __getattr__
    op, overload_names = torch._C._jit_get_operation(qualified_op_name)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: No such operator fbgemm::jagged_2d_to_dense

The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/fbgemm_gpu/__init__.py", line 22, in <module>
    from . import _fbgemm_gpu_docs  # noqa: F401, E402
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/fbgemm_gpu/_fbgemm_gpu_docs.py", line 19, in <module>
    torch.ops.fbgemm.jagged_2d_to_dense,
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/torch/_ops.py", line 761, in __getattr__
    raise AttributeError(
AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'jagged_2d_to_dense'

I tried to re-install fbgemm-gpu as the document recommended but it couldn't help. Is this issue expected on Intel device?

q10 commented 10 months ago

Hi @AlienLiang23 we currently don't support Intel GPU as far as Im aware of, but in any case, could you show us the instructions that you ran to perform the installation and the full log of the error message?

We generally recommend setting up FBGEMM_GPU and all of its dependencie inside a Conda environment, per our installation instructions.

AlienLiang23 commented 10 months ago

Thanks for your reply@q10. fbgemm was installed while building torch. found that issue, I uninstalled it using pip uninstall fbgemm-gpu and re-installed it using pip install fbgemm-gpu mentioned that on A100, this same problem was resolved by reinstalling.

full log:

Testing model torchrec_dlrma
Testing with training mode.
Test amp with dt: torch.bfloat16
loading model: 0it [00:00, ?it/s]
libcudart.so.12: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/torch/_ops.py", line 757, in __getattr__
    op, overload_names = torch._C._jit_get_operation(qualified_op_name)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: No such operator fbgemm::jagged_2d_to_dense

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/gta_local/yongliang/frameworks.ai.pytorch.private-gpu/benchmarks/dynamo/torchbench.py", line 481, in <module>
    torchbench_main()
  File "/home/gta_local/yongliang/frameworks.ai.pytorch.private-gpu/benchmarks/dynamo/torchbench.py", line 477, in torchbench_main
    main(TorchBenchmarkRunner(), original_dir)
  File "/home/gta_local/yongliang/frameworks.ai.pytorch.private-gpu/benchmarks/dynamo/common.py", line 3034, in main
    process_entry(0, runner, original_dir, args)
  File "/home/gta_local/yongliang/frameworks.ai.pytorch.private-gpu/benchmarks/dynamo/common.py", line 2991, in process_entry
    return maybe_fresh_cache(
           ^^^^^^^^^^^^^^^^^^
  File "/home/gta_local/yongliang/frameworks.ai.pytorch.private-gpu/benchmarks/dynamo/common.py", line 1654, in inner
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/gta_local/yongliang/frameworks.ai.pytorch.private-gpu/benchmarks/dynamo/common.py", line 3444, in run
    ) = runner.load_model(
        ^^^^^^^^^^^^^^^^^^
  File "/home/gta_local/yongliang/frameworks.ai.pytorch.private-gpu/benchmarks/dynamo/torchbench.py", line 313, in load_model
    module = importlib.import_module(c)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/gta_local/yongliang/benchmark/torchbenchmark/canary_models/torchrec_dlrm/__init__.py", line 7, in <module>
    from .data.dlrm_dataloader import get_dataloader
  File "/home/gta_local/yongliang/benchmark/torchbenchmark/canary_models/torchrec_dlrm/data/dlrm_dataloader.py", line 13, in <module>
    from torchrec.datasets.criteo import (
  File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/torchrec/__init__.py", line 8, in <module>
    import torchrec.distributed  # noqa
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/torchrec/distributed/__init__.py", line 36, in <module>
    from torchrec.distributed.model_parallel import DistributedModelParallel  # noqa
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/torchrec/distributed/model_parallel.py", line 21, in <module>
    from torchrec.distributed.planner import EmbeddingShardingPlanner, Topology
  File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/torchrec/distributed/planner/__init__.py", line 22, in <module>
    from torchrec.distributed.planner.planners import EmbeddingShardingPlanner  # noqa
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/torchrec/distributed/planner/planners.py", line 19, in <module>
    from torchrec.distributed.planner.constants import BATCH_SIZE, MAX_SIZE
  File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/torchrec/distributed/planner/constants.py", line 10, in <module>
    from torchrec.distributed.embedding_types import EmbeddingComputeKernel
  File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/torchrec/distributed/embedding_types.py", line 14, in <module>
    from fbgemm_gpu.split_table_batched_embeddings_ops_training import EmbeddingLocation
  File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/fbgemm_gpu/__init__.py", line 22, in <module>
    from . import _fbgemm_gpu_docs  # noqa: F401, E402
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/fbgemm_gpu/_fbgemm_gpu_docs.py", line 19, in <module>
    torch.ops.fbgemm.jagged_2d_to_dense,
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gta/miniconda3/envs/yongliang/lib/python3.11/site-packages/torch/_ops.py", line 761, in __getattr__
    raise AttributeError(
AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'jagged_2d_to_dense'
q10 commented 10 months ago

@AlienLiang23 based on the logs pasted, the actual error appears to be:

libcudart.so.12: cannot open shared object file: No such file or directory

Unfortunately, every loading error ends up with the signature AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'jagged_2d_to_dense', which can be cryptic.

As stated in the error message, you will need to install libcudart.so into your PATH. I believe this should be available when you install the full CUDA package onto your system.