Issue Running Single Task "insert_onto_square_peg" Due to Operator FarthestPointSampler Not Supporting CUDA Device

LemonWade commented 7 months ago

Firstly, I'd like to express my gratitude for your work on this project. I've been attempting to run the code, specifically the single task insert_onto_square_pegHowever, I've encountered an error stating that the operator FarthestPointSampler does not support the CUDA device.

I'm not sure where I might have gone wrong in the process. Any advice or suggestions you could offer would be greatly appreciated. Thank you very much for your help.

Error message:

Arguments:
{'accumulate_grad_batches': 1,
 'backbone': 'clip',
 'base_log_dir': PosixPath('train_logs'),
 'batch_size': 2,
 'batch_size_val': 14,
 'cache_size': 600,
 'cache_size_val': 0,
 'cameras': ('left_shoulder', 'right_shoulder', 'wrist', 'front'),
 'checkpoint': None,
 'dataset': PosixPath('data/peract/Peract_packaged/train'),
 'dense_interpolation': 1,
 'diffusion_timesteps': 100,
 'embedding_dim': 120,
 'eval_only': 0,
 'exp_log_dir': 'Actor_18Peract_100Demo_multitask',
 'fps_subsampling_factor': 5,
 'gripper_loc_bounds': 'tasks/18_peract_tasks_location_bounds.json',
 'gripper_loc_bounds_buffer': 0.04,
 'image_rescale': '0.75,1.25',
 'image_size': '256,256',
 'instructions': PosixPath('instructions.pkl'),
 'interpolation_length': 2,
 'keypose_only': 1,
 'lang_enhanced': 0,
 'lr': 0.0001,
 'max_episode_length': 5,
 'max_episodes_per_task': -1,
 'num_history': 3,
 'num_vis_ins_attn_layers': 2,
 'num_workers': 1,
 'quaternion_format': 'xyzw',
 'relative_action': 0,
 'rotation_parametrization': '6D',
 'run_log_dir': 'diffusion_multitask-test',
 'seed': 0,
 'tasks': ('insert_onto_square_peg',),
 'train_iters': 600000,
 'use_instruction': 1,
 'val_freq': 4000,
 'val_iters': -1,
 'valset': PosixPath('data/peract/Peract_packaged/val'),
 'variations': (0,
                1,
                2,
.............
 'wd': 0.005}
----------------------------------------------------------------------------------------------------
Gripper workspace size: [0.6381958  0.86764328 0.29660918]
Logging: train_logs/Actor_18Peract_100Demo_multitask/diffusion_multitask-test
Available devices (CUDA_VISIBLE_DEVICES): None
Device count 1
Created dataset from data/peract/Peract_packaged/train with 40
Created dataset from data/peract/Peract_packaged/val with 40
Model parameters: 3584290
  0%|                                                                                                                     | 0/600000 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "main_trajectory.py", line 452, in <module>
    train_tester.main(collate_fn=traj_collate_fn)
  File "/home/gml/3d_diffuser_actor/engine.py", line 155, in main
    self.train_one_step(model, criterion, optimizer, step_id, sample)
  File "main_trajectory.py", line 185, in train_one_step
    out = model(
  File "/home/gml/anaconda3/envs/3d_diffuser_actor2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/gml/anaconda3/envs/3d_diffuser_actor2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gml/anaconda3/envs/3d_diffuser_actor2/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1523, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/home/gml/anaconda3/envs/3d_diffuser_actor2/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/home/gml/anaconda3/envs/3d_diffuser_actor2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/gml/anaconda3/envs/3d_diffuser_actor2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gml/3d_diffuser_actor/diffuser_actor/trajectory_optimization/diffuser_actor.py", line 354, in forward
    fixed_inputs = self.encode_inputs(
  File "/home/gml/3d_diffuser_actor/diffuser_actor/trajectory_optimization/diffuser_actor.py", line 108, in encode_inputs
    fps_feats, fps_pos = self.encoder.run_fps(
  File "/home/gml/3d_diffuser_actor/diffuser_actor/utils/encoder.py", line 245, in run_fps
    sampled_inds = dgl_geo.farthest_point_sampler(
  File "/home/gml/anaconda3/envs/3d_diffuser_actor2/lib/python3.8/site-packages/dgl/geometry/fps.py", line 64, in farthest_point_sampler
    _farthest_point_sampler(pos, B, npoints, dist, start_idx, result)
  File "/home/gml/anaconda3/envs/3d_diffuser_actor2/lib/python3.8/site-packages/dgl/geometry/capi.py", line 37, in _farthest_point_sampler
    _CAPI_FarthestPointSampler(
  File "dgl/_ffi/_cython/./function.pxi", line 295, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 241, in dgl._ffi._cy3.core.FuncCall
dgl._ffi.base.DGLError: [11:11:36] /opt/dgl/src/geometry/geometry.cc:40: Operator FarthestPointSampler does not support cuda device.
Stack trace:
  [bt] (0) /home/gml/anaconda3/envs/3d_diffuser_actor2/lib/python3.8/site-packages/dgl/libdgl.so(+0x59f831) [0x7fc913cc4831]
  [bt] (1) /home/gml/anaconda3/envs/3d_diffuser_actor2/lib/python3.8/site-packages/dgl/libdgl.so(dgl::geometry::FarthestPointSampler(dgl::runtime::NDArray, long, long, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray)+0x69a) [0x7fc913cc5b1a]
  [bt] (2) /home/gml/anaconda3/envs/3d_diffuser_actor2/lib/python3.8/site-packages/dgl/libdgl.so(+0x5a3226) [0x7fc913cc8226]
  [bt] (3) /home/gml/anaconda3/envs/3d_diffuser_actor2/lib/python3.8/site-packages/dgl/libdgl.so(DGLFuncCall+0x4c) [0x7fc913cdee2c]
  [bt] (4) /home/gml/anaconda3/envs/3d_diffuser_actor2/lib/python3.8/site-packages/dgl/_ffi/_cy3/core.cpython-38-x86_64-linux-gnu.so(+0x1bfc4) [0x7fc913318fc4]
  [bt] (5) /home/gml/anaconda3/envs/3d_diffuser_actor2/lib/python3.8/site-packages/dgl/_ffi/_cy3/core.cpython-38-x86_64-linux-gnu.so(+0x1c32f) [0x7fc91331932f]
  [bt] (6) /home/gml/anaconda3/envs/3d_diffuser_actor2/bin/python3.8(_PyObject_MakeTpCall+0x2b4) [0x55e8d4537194]
  [bt] (7) /home/gml/anaconda3/envs/3d_diffuser_actor2/bin/python3.8(_PyEval_EvalFrameDefault+0x489f) [0x55e8d45331cf]
  [bt] (8) /home/gml/anaconda3/envs/3d_diffuser_actor2/bin/python3.8(_PyFunction_Vectorcall+0xfe) [0x55e8d453e90e]

twke18 commented 7 months ago

Hi,

Did you install a dgl version, which support GPU? We used the following command for installing dgl library, which works fine on our side.

pip install dgl -f https://data.dgl.ai/wheels/cu116/dgl-1.1.3%2Bcu116-cp38-cp38-manylinux1_x86_64.whl

LemonWade commented 7 months ago

Thank you for your response. I have re-downloaded DGL, using the --no-cache-dir --force-reinstall options. Below are the details of the download process and the environment information.

(3d_diffuser_actor2) (base) gml@gml:~/3d_diffuser_actor$ pip install dgl -f https://data.dgl.ai/wheels/cu116/dgl-1.1.3%2Bcu116-cp38-cp38-manylinux1_x86_64.whl --no-cache-dir --force-reinstall
Looking in links: https://data.dgl.ai/wheels/cu116/dgl-1.1.3%2Bcu116-cp38-cp38-manylinux1_x86_64.whl
WARNING: Skipping page https://data.dgl.ai/wheels/cu116/dgl-1.1.3%2Bcu116-cp38-cp38-manylinux1_x86_64.whl because the HEAD request got Content-Type: binary/octet-stream. The only supported Content-Types are application/vnd.pypi.simple.v1+json, application/vnd.pypi.simple.v1+html, and text/html
Collecting dgl
  Downloading dgl-2.1.0-cp38-cp38-manylinux1_x86_64.whl.metadata (581 bytes)
WARNING: Skipping page https://data.dgl.ai/wheels/cu116/dgl-1.1.3%2Bcu116-cp38-cp38-manylinux1_x86_64.whl because the HEAD request got Content-Type: binary/octet-stream. The only supported Content-Types are application/vnd.pypi.simple.v1+json, application/vnd.pypi.simple.v1+html, and text/html
Collecting numpy>=1.14.0 (from dgl)
  Downloading numpy-1.24.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
WARNING: Skipping page https://data.dgl.ai/wheels/cu116/dgl-1.1.3%2Bcu116-cp38-cp38-manylinux1_x86_64.whl because the HEAD request got Content-Type: binary/octet-stream. The only supported Content-Types are application/vnd.pypi.simple.v1+json, application/vnd.pypi.simple.v1+html, and text/html
Collecting scipy>=1.1.0 (from dgl)
  Downloading scipy-1.10.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (58 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.9/58.9 kB 376.7 kB/s eta 0:00:00
WARNING: Skipping page https://data.dgl.ai/wheels/cu116/dgl-1.1.3%2Bcu116-cp38-cp38-manylinux1_x86_64.whl because the HEAD request got Content-Type: binary/octet-stream. The only supported Content-Types are application/vnd.pypi.simple.v1+json, application/vnd.pypi.simple.v1+html, and text/html
Collecting networkx>=2.1 (from dgl)
  Downloading networkx-3.1-py3-none-any.whl.metadata (5.3 kB)
WARNING: Skipping page https://data.dgl.ai/wheels/cu116/dgl-1.1.3%2Bcu116-cp38-cp38-manylinux1_x86_64.whl because the HEAD request got Content-Type: binary/octet-stream. The only supported Content-Types are application/vnd.pypi.simple.v1+json, application/vnd.pypi.simple.v1+html, and text/html
Collecting requests>=2.19.0 (from dgl)
  Downloading requests-2.31.0-py3-none-any.whl.metadata (4.6 kB)
WARNING: Skipping page https://data.dgl.ai/wheels/cu116/dgl-1.1.3%2Bcu116-cp38-cp38-manylinux1_x86_64.whl because the HEAD request got Content-Type: binary/octet-stream. The only supported Content-Types are application/vnd.pypi.simple.v1+json, application/vnd.pypi.simple.v1+html, and text/html
Collecting tqdm (from dgl)
  Downloading tqdm-4.66.2-py3-none-any.whl.metadata (57 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.6/57.6 kB 896.1 kB/s eta 0:00:00
WARNING: Skipping page https://data.dgl.ai/wheels/cu116/dgl-1.1.3%2Bcu116-cp38-cp38-manylinux1_x86_64.whl because the HEAD request got Content-Type: binary/octet-stream. The only supported Content-Types are application/vnd.pypi.simple.v1+json, application/vnd.pypi.simple.v1+html, and text/html
Collecting psutil>=5.8.0 (from dgl)
  Downloading psutil-5.9.8-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (21 kB)
WARNING: Skipping page https://data.dgl.ai/wheels/cu116/dgl-1.1.3%2Bcu116-cp38-cp38-manylinux1_x86_64.whl because the HEAD request got Content-Type: binary/octet-stream. The only supported Content-Types are application/vnd.pypi.simple.v1+json, application/vnd.pypi.simple.v1+html, and text/html
Collecting torchdata>=0.5.0 (from dgl)
  Downloading torchdata-0.7.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
WARNING: Skipping page https://data.dgl.ai/wheels/cu116/dgl-1.1.3%2Bcu116-cp38-cp38-manylinux1_x86_64.whl because the HEAD request got Content-Type: binary/octet-stream. The only supported Content-Types are application/vnd.pypi.simple.v1+json, application/vnd.pypi.simple.v1+html, and text/html
Collecting charset-normalizer<4,>=2 (from requests>=2.19.0->dgl)
  Downloading charset_normalizer-3.3.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (33 kB)
WARNING: Skipping page https://data.dgl.ai/wheels/cu116/dgl-1.1.3%2Bcu116-cp38-cp38-manylinux1_x86_64.whl because the HEAD request got Content-Type: binary/octet-stream. The only supported Content-Types are application/vnd.pypi.simple.v1+json, application/vnd.pypi.simple.v1+html, and text/html
Collecting idna<4,>=2.5 (from requests>=2.19.0->dgl)
  Downloading idna-3.6-py3-none-any.whl.metadata (9.9 kB)
WARNING: Skipping page https://data.dgl.ai/wheels/cu116/dgl-1.1.3%2Bcu116-cp38-cp38-manylinux1_x86_64.whl because the HEAD request got Content-Type: binary/octet-stream. The only supported Content-Types are application/vnd.pypi.simple.v1+json, application/vnd.pypi.simple.v1+html, and text/html
Collecting urllib3<3,>=1.21.1 (from requests>=2.19.0->dgl)
  Downloading urllib3-2.2.1-py3-none-any.whl.metadata (6.4 kB)
WARNING: Skipping page https://data.dgl.ai/wheels/cu116/dgl-1.1.3%2Bcu116-cp38-cp38-manylinux1_x86_64.whl because the HEAD request got Content-Type: binary/octet-stream. The only supported Content-Types are application/vnd.pypi.simple.v1+json, application/vnd.pypi.simple.v1+html, and text/html
Collecting certifi>=2017.4.17 (from requests>=2.19.0->dgl)
  Downloading certifi-2024.2.2-py3-none-any.whl.metadata (2.2 kB)
WARNING: Skipping page https://data.dgl.ai/wheels/cu116/dgl-1.1.3%2Bcu116-cp38-cp38-manylinux1_x86_64.whl because the HEAD request got Content-Type: binary/octet-stream. The only supported Content-Types are application/vnd.pypi.simple.v1+json, application/vnd.pypi.simple.v1+html, and text/html
Collecting torch>=2 (from torchdata>=0.5.0->dgl)
  Downloading torch-2.2.1-cp38-cp38-manylinux1_x86_64.whl.metadata (25 kB)
WARNING: Skipping page https://data.dgl.ai/wheels/cu116/dgl-1.1.3%2Bcu116-cp38-cp38-manylinux1_x86_64.whl because the HEAD request got Content-Type: binary/octet-stream. The only supported Content-Types are application/vnd.pypi.simple.v1+json, application/vnd.pypi.simple.v1+html, and text/html
.............
Collecting mpmath>=0.19 (from sympy->torch>=2->torchdata>=0.5.0->dgl)
  Downloading mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Downloading dgl-2.1.0-cp38-cp38-manylinux1_x86_64.whl (8.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.6/8.6 MB 2.5 MB/s eta 0:00:00

Despite following your advice and re-downloading DGL using the --no-cache-dir --force-reinstall options, I'm still facing the same error. Additionally, I tried the following commands as per the suggestions:

# If you have installed dgl-cuXX package, please uninstall it first.
pip install dgl -f https://data.dgl.ai/wheels/cu117/repo.html
pip install dglgo -f https://data.dgl.ai/wheels-test/repo.html

After running these commands and attempting to execute my program again, I encountered a new error: FileNotFoundError: Cannot find DGL C++ graphbolt library at /home/gml/anaconda3/envs/3d_diffuser_actor2/lib/python3.8/site-packages/dgl/graphbolt/libgraphbolt_pytorch_2.2.1.so

My CUDA version is 11.7, and I am using a 3090 GPU. Should I consider downgrading my CUDA version?

I would greatly appreciate any further guidance or suggestions you can offer.

LemonWade commented 7 months ago

Is the issue possibly due to the fact that when I downloaded DGL version 2.1.0, I did not include the cu116 tag?

pip list

dgl                           2.1.0
diffusers                     0.27.2

twke18 commented 7 months ago

Hi,

Your problem is that dgl library is not installed at the correct cuda version. We also didn't test dgl.v2, but I believe it should also work if the API is not changed.

You can find the pip wheel with cuda 1.7 from https://data.dgl.ai/wheels/cu117/repo.html. For example, try dgl-1.1.3+cu117-cp39-cp39-manylinux1_x86_64.whl if you are using python3.9.

LemonWade commented 7 months ago

Thank you for your response. Following your advice, I downloaded the .whl file from the website and installed it via pip. I successfully ran the program. Thank you once again.

nickgkan / 3d_diffuser_actor

Issue Running Single Task "insert_onto_square_peg" Due to Operator FarthestPointSampler Not Supporting CUDA Device #18