intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs
MIT License
137 stars 41 forks source link

[torchbench] torchrec_dlrm fails to run #548

Open pbchekin opened 7 months ago

pbchekin commented 7 months ago
./inductor_xpu_test.sh torchbench amp_fp16 inference accuracy xpu 0 static 1 0 torchrec_dlrm
Traceback (most recent call last):
  File "/home/jovyan/pytorch/benchmarks/dynamo/torchbench.py", line 481, in <module>
    torchbench_main()
  File "/home/jovyan/pytorch/benchmarks/dynamo/torchbench.py", line 477, in torchbench_main
    main(TorchBenchmarkRunner(), original_dir)
  File "/home/jovyan/pytorch/benchmarks/dynamo/common.py", line 3041, in main
    process_entry(0, runner, original_dir, args)
  File "/home/jovyan/pytorch/benchmarks/dynamo/common.py", line 2998, in process_entry
    return maybe_fresh_cache(
  File "/home/jovyan/pytorch/benchmarks/dynamo/common.py", line 1661, in inner
    return fn(*args, **kwargs)
  File "/home/jovyan/pytorch/benchmarks/dynamo/common.py", line 3451, in run
    ) = runner.load_model(
  File "/home/jovyan/pytorch/benchmarks/dynamo/torchbench.py", line 313, in load_model
    module = importlib.import_module(c)
  File "/home/jovyan/.conda/envs/python-3.9/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/home/jovyan/benchmark/torchbenchmark/canary_models/torchrec_dlrm/__init__.py", line 7, in <module>
    from .data.dlrm_dataloader import get_dataloader
  File "/home/jovyan/benchmark/torchbenchmark/canary_models/torchrec_dlrm/data/dlrm_dataloader.py", line 13, in <module>
    from torchrec.datasets.criteo import (
  File "/home/jovyan/.conda/envs/python-3.9/lib/python3.9/site-packages/torchrec/__init__.py", line 8, in <module>
    import torchrec.distributed  # noqa
  File "/home/jovyan/.conda/envs/python-3.9/lib/python3.9/site-packages/torchrec/distributed/__init__.py", line 36, in <module>
    from torchrec.distributed.model_parallel import DistributedModelParallel  # noqa
  File "/home/jovyan/.conda/envs/python-3.9/lib/python3.9/site-packages/torchrec/distributed/model_parallel.py", line 24, in <module>
    from torchrec.distributed.planner import EmbeddingShardingPlanner, Topology
  File "/home/jovyan/.conda/envs/python-3.9/lib/python3.9/site-packages/torchrec/distributed/planner/__init__.py", line 22, in <module>
    from torchrec.distributed.planner.planners import EmbeddingShardingPlanner  # noqa
  File "/home/jovyan/.conda/envs/python-3.9/lib/python3.9/site-packages/torchrec/distributed/planner/planners.py", line 19, in <module>
    from torchrec.distributed.planner.constants import BATCH_SIZE, MAX_SIZE
  File "/home/jovyan/.conda/envs/python-3.9/lib/python3.9/site-packages/torchrec/distributed/planner/constants.py", line 10, in <module>
    from torchrec.distributed.embedding_types import EmbeddingComputeKernel
  File "/home/jovyan/.conda/envs/python-3.9/lib/python3.9/site-packages/torchrec/distributed/embedding_types.py", line 14, in <module>
    from fbgemm_gpu.split_table_batched_embeddings_ops_training import EmbeddingLocation
  File "/home/jovyan/.conda/envs/python-3.9/lib/python3.9/site-packages/fbgemm_gpu/__init__.py", line 23, in <module>
    from . import _fbgemm_gpu_docs, sparse_ops  # noqa: F401, E402  # noqa: F401, E402
  File "/home/jovyan/.conda/envs/python-3.9/lib/python3.9/site-packages/fbgemm_gpu/_fbgemm_gpu_docs.py", line 19, in <module>
    torch.ops.fbgemm.jagged_2d_to_dense,
  File "/home/jovyan/.conda/envs/python-3.9/lib/python3.9/site-packages/torch/_ops.py", line 761, in __getattr__
    raise AttributeError(
AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'jagged_2d_to_dense'
gshimansky commented 7 months ago

The problem with this benchmark is that it unconditionally imports torchrec which in its turn unconditionally imports fbgemm. Both libraries seem to exist only for CUDA (especially fbgemm) and aren't supposed to work on any other GPUs.

gshimansky commented 7 months ago

Torchrec readme suggests to install fbgemm_gpu for CPU usnig command pip install fbgemm-gpu --index-url https://download.pytorch.org/whl/nightly/cpu but with this version I am getting AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'jagged_2d_to_dense' again. The problem is caused by incompatibility of fbgemm nightly and out pytorch versions. When native code is loaded I see an error /home/jovyan/.conda/envs/triton-no-conda-3.10-stonepia/lib/python3.10/site-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv and therefore there are no native function definitions.

gshimansky commented 7 months ago

Ok it is possible to make this benchmark work but it is a considerable effort.

  1. fbgemm_gpu package by default is built for CUDA and we cannot use binaries for CPU because they don't match our installation of pytorch so we need to build fbgemm_gpu from sources.
  2. There is a bug in fbgemm_gpu sources, they need to be patched. These three lines should be deleted https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/src/jagged_tensor_ops/jagged_tensor_ops_cpu.cpp#L1661-L1663 because they should exist only under #ifdef.
  3. Some extra packages are required to build library successfully. Here are config command lines that I used:
    mkdir build
    cd build
    cmake -DUSE_SANITIZER=address -DFBGEMM_LIBRARY_TYPE=shared -DPYTHON_EXECUTABLE=`which python3` -DFBGEMM_BUILD_DOCS=OFF -DFBGEMM_BUILD_BENCHMARKS=OFF -DCMAKE_INSTALL_PREFIX=${CONDA_PREFIX} ..
    make -j
    make install
    cd ../fbgemm_gpu
    export package_name=fbgemm_gpu_cpu
    export python_tag=py310
    export ARCH=$(uname -m)
    export python_plat_name="manylinux2014_${ARCH}"
    python setup.py bdist_wheel     --package_variant=cpu     --package_name="${package_name}"     --python-tag="${python_tag}"     --plat-name="${python_plat_name}"
    python setup.py install --package_variant=cpu

    it is possible that C++ library is not required for python, maybe it is enough to build just python fbgemm_gpu.

gshimansky commented 7 months ago

Bug report on fbgemm_gpu build https://github.com/pytorch/FBGEMM/issues/2362

vlad-penkin commented 4 months ago

The issue is still reproducible.