taichi-dev / taichi

Productive, portable, and performant GPU programming in Python.
https://taichi-lang.org
Apache License 2.0
25.5k stars 2.28k forks source link

GGUI error if CUDA_VISIBLE_DEVICES is not 0 #5835

Closed zhenjia-xu closed 2 years ago

zhenjia-xu commented 2 years ago

Describe the bug I want to use taichi on a multi-gpu server. But GGUI has an error if the CUDA_VISIBLE_DEVICES is not 0

To Reproduce

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '1'
import taichi as ti

ti.init(arch=ti.cuda, device_memory_GB=1, packed=True)
window = ti.ui.Window("Test", (960, 960), vsync=True, show_window=False)
canvas = window.get_canvas()
scene = ti.ui.Scene()

pos = ti.Vector.field(3, dtype=ti.f32, shape=(1))
scene.particles(pos, color=(0.5, 0.5, 0.5, 1), radius=0.01)

canvas.scene(scene)

Log/Screenshots

-> % python bug.py
[Taichi] version 1.1.2, llvm 10.0.0, commit f25cf4a2, linux, python 3.7.12
[Taichi] Starting on arch=cuda
[I 08/21/22 22:08:31.428 688283] [vulkan_device_creator.cpp:pick_physical_device@372] Found Vulkan Device 0 (llvmpipe (LLVM 12.0.0, 256 bits))
[I 08/21/22 22:08:31.428 688283] [vulkan_device_creator.cpp:pick_physical_device@372] Found Vulkan Device 1 (NVIDIA GeForce RTX 3090)
[I 08/21/22 22:08:31.428 688283] [vulkan_device_creator.cpp:pick_physical_device@372] Found Vulkan Device 2 (NVIDIA GeForce RTX 3090)
[I 08/21/22 22:08:31.428 688283] [vulkan_device_creator.cpp:pick_physical_device@372] Found Vulkan Device 3 (NVIDIA GeForce RTX 3090)
[I 08/21/22 22:08:31.429 688283] [vulkan_device_creator.cpp:pick_physical_device@372] Found Vulkan Device 4 (NVIDIA GeForce RTX 3090)
[I 08/21/22 22:08:31.429 688283] [vulkan_device_creator.cpp:pick_physical_device@372] Found Vulkan Device 5 (NVIDIA GeForce RTX 3090)
[I 08/21/22 22:08:31.429 688283] [vulkan_device_creator.cpp:pick_physical_device@372] Found Vulkan Device 6 (NVIDIA GeForce RTX 3090)
[I 08/21/22 22:08:31.429 688283] [vulkan_device_creator.cpp:pick_physical_device@372] Found Vulkan Device 7 (NVIDIA GeForce RTX 3090)
[I 08/21/22 22:08:31.429 688283] [vulkan_device_creator.cpp:pick_physical_device@372] Found Vulkan Device 8 (NVIDIA GeForce RTX 3090)
[I 08/21/22 22:08:31.429 688283] [vulkan_device_creator.cpp:create_logical_device@440] Vulkan Device "NVIDIA GeForce RTX 3090" supports Vulkan 0 version 1.3.194
[W 08/21/22 22:08:31.688 688283] [vulkan_device.cpp:buffer@620] Overriding last binding
[E 08/21/22 22:08:31.694 688283] [cuda_driver.h:operator()@87] CUDA Error CUDA_ERROR_INVALID_DEVICE: invalid device ordinal while calling external_memory_get_mapped_buffer (cuExternalMemoryGetMappedBuffer)

Traceback (most recent call last):
  File "bug.py", line 13, in <module>
    canvas.scene(scene)
  File "/local/crv/zhenjia/mambaforge/envs/cut/lib/python3.7/site-packages/taichi/ui/canvas.py", line 126, in scene
    self.canvas.scene(scene.scene)
RuntimeError: [cuda_driver.h:operator()@87] CUDA Error CUDA_ERROR_INVALID_DEVICE: invalid device ordinal while calling external_memory_get_mapped_buffer (cuExternalMemoryGetMappedBuffer)
...

Additional comments

-> % ti diagnose
[Taichi] version 1.1.2, llvm 10.0.0, commit f25cf4a2, linux, python 3.7.12

*******************************************
**      Taichi Programming Language      **
*******************************************

Docs:   https://docs.taichi-lang.org/
GitHub: https://github.com/taichi-dev/taichi/
Forum:  https://forum.taichi.graphics/

Taichi system diagnose:

python: 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) 
[GCC 9.4.0]
system: linux
executable: /local/crv/zhenjia/mambaforge/envs/cut/bin/python
platform: Linux-5.4.0-105-generic-x86_64-with-debian-bullseye-sid
architecture: 64bit 
uname: uname_result(system='Linux', node='crv03', release='5.4.0-105-generic', version='#119-Ubuntu SMP Mon Mar 7 18:49:24 UTC 2022', machine='x86_64', processor='x86_64')
locale: en_US.UTF-8
PATH: /local/crv/zhenjia/mambaforge/envs/cut/bin:/home/xuzhenjia/.vscode-server/bin/e4503b30fc78200f846c62cf8091b76ff5547662/bin/remote-cli:/local/crv/zhenjia/mambaforge/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin:/local/crv/zhenjia/blender:/home/xuzhenjia/mambaforge/bin:/home/xuzhenjia/bin:/usr/local/cuda/bin:/local/crv/zhenjia/blender:/home/xuzhenjia/mambaforge/bin:/home/xuzhenjia/bin
PYTHONPATH: ['/local/crv/zhenjia/mambaforge/envs/cut/bin', '/local/crv/zhenjia/mambaforge/envs/cut/lib/python37.zip', '/local/crv/zhenjia/mambaforge/envs/cut/lib/python3.7', '/local/crv/zhenjia/mambaforge/envs/cut/lib/python3.7/lib-dynload', '/local/crv/zhenjia/mambaforge/envs/cut/lib/python3.7/site-packages']

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.4 LTS
Release:        20.04
Codename:       focal

import: <module 'taichi' from '/local/crv/zhenjia/mambaforge/envs/cut/lib/python3.7/site-packages/taichi/__init__.py'>

cc: False
cpu: True
metal: False
opengl: True
cuda: True
vulkan: True

`glewinfo` not available: [Errno 2] No such file or directory: 'glewinfo': 'glewinfo'

Sun Aug 21 22:16:30 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:1D:00.0 Off |                  N/A |
| 50%   53C    P2   116W / 350W |   8750MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:1E:00.0  On |                  N/A |
| 53%   60C    P2   194W / 350W |   5855MiB / 24576MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  On   | 00000000:1F:00.0 Off |                  N/A |
| 30%   46C    P2   107W / 350W |  18200MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  On   | 00000000:20:00.0 Off |                  N/A |
| 30%   47C    P2   156W / 350W |  18143MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce ...  On   | 00000000:21:00.0 Off |                  N/A |
| 30%   42C    P8    57W / 350W |  18078MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce ...  On   | 00000000:22:00.0 Off |                  N/A |
| 34%   51C    P2   118W / 350W |  22378MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce ...  On   | 00000000:23:00.0 Off |                  N/A |
| 70%   61C    P2   279W / 350W |  23904MiB / 24576MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce ...  On   | 00000000:24:00.0 Off |                  N/A |
| 63%   59C    P2   239W / 350W |  22029MiB / 24576MiB |     22%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    622887      C   python                           3243MiB |
|    0   N/A  N/A    623588      C   ray::SimEnv.step()               1793MiB |
|    0   N/A  N/A    623589      C   ray::SimEnv.step()               1793MiB |
|    0   N/A  N/A    623600      C   ray::SimEnv.step()               1793MiB |
|    0   N/A  N/A   1281583      C                                      83MiB |
|    0   N/A  N/A   3391217      G   /usr/lib/xorg/Xorg                 28MiB |
|    0   N/A  N/A   3391366      G   /usr/bin/gnome-shell               10MiB |
|    1   N/A  N/A    623580    C+G   ray::SimEnv.step()               1834MiB |
|    1   N/A  N/A    623588      G   ray::SimEnv.step()                 41MiB |
|    1   N/A  N/A    623589      G   ray::SimEnv.step()                 41MiB |
|    1   N/A  N/A    623595    C+G   ray::SimEnv.step()               1834MiB |
|    1   N/A  N/A    623596    C+G   ray::SimEnv.step()               1834MiB |
|    1   N/A  N/A    623600      G   ray::SimEnv.step()                 41MiB |
|    1   N/A  N/A    740641      G   /snap/blender/2661/blender         67MiB |
|    1   N/A  N/A   3391217      G   /usr/lib/xorg/Xorg                 67MiB |
|    1   N/A  N/A   3952884      G   /snap/blender/2578/blender         40MiB |
|    2   N/A  N/A    181754      C   python                          18057MiB |
|    2   N/A  N/A   3391217      G   /usr/lib/xorg/Xorg                 15MiB |
|    3   N/A  N/A    128030      C   python                          18057MiB |
|    3   N/A  N/A   3391217      G   /usr/lib/xorg/Xorg                 15MiB |
|    4   N/A  N/A    251179      C   python                          18057MiB |
|    4   N/A  N/A   3391217      G   /usr/lib/xorg/Xorg                 15MiB |
|    5   N/A  N/A    693452      C   python                          20537MiB |
|    5   N/A  N/A   1281583      C                                    1757MiB |
|    5   N/A  N/A   3391217      G   /usr/lib/xorg/Xorg                 15MiB |
|    6   N/A  N/A    145122      C   python                          23815MiB |
|    6   N/A  N/A   3391217      G   /usr/lib/xorg/Xorg                 15MiB |
|    6   N/A  N/A   3895564      G   /snap/blender/2578/blender          4MiB |
|    7   N/A  N/A     85789      C   python                          21989MiB |
|    7   N/A  N/A   3391217      G   /usr/lib/xorg/Xorg                 15MiB |
+-----------------------------------------------------------------------------+

[Taichi] version 1.1.2, llvm 10.0.0, commit f25cf4a2, linux, python 3.7.12

[Taichi] version 1.1.2, llvm 10.0.0, commit f25cf4a2, linux, python 3.7.12
[Taichi] Starting on arch=x64

[Taichi] version 1.1.2, llvm 10.0.0, commit f25cf4a2, linux, python 3.7.12
[Taichi] Starting on arch=opengl

[Taichi] version 1.1.2, llvm 10.0.0, commit f25cf4a2, linux, python 3.7.12
[Taichi] Starting on arch=cuda

[Taichi] version 1.1.2, llvm 10.0.0, commit f25cf4a2, linux, python 3.7.12

*******************************************
**      Taichi Programming Language      **
*******************************************

Docs:   https://docs.taichi-lang.org/
GitHub: https://github.com/taichi-dev/taichi/
Forum:  https://forum.taichi.graphics/

                                 TAICHI EXAMPLES                                  
 ──────────────────────────────────────────────────────────────────────────────── 
  0: ad_gravity               22: keyboard                44: patterns            
  1: comet                    23: laplace                 45: pbf2d               
  2: cornell_box              24: mandelbrot_zoom         46: physarum            
  3: diff_sph                 25: marching_squares        47: print_offset        
  4: euler                    26: mass_spring_3d_ggui     48: rasterizer          
  5: explicit_activation      27: mass_spring_game        49: regression          
  6: export_mesh              28: mass_spring_game_ggui   50: sdf_renderer        
  7: export_ply               29: mciso_advanced          51: simple_derivative   
  8: export_videos            30: mgpcg                   52: simple_texture      
  9: fem128                   31: mgpcg_advanced          53: simple_uv           
  10: fem128_ggui             32: minimal                 54: stable_fluid        
  11: fem99                   33: minimization            55: stable_fluid_ggui   
  12: fractal                 34: mpm128                  56: stable_fluid_graph  
  13: fractal3d_ggui          35: mpm128_ggui             57: taichi_bitmasked    
  14: fullscreen              36: mpm3d                   58: taichi_dynamic      
  15: game_of_life            37: mpm3d_ggui              59: taichi_logo         
  16: gui_image_io            38: mpm88                   60: taichi_sparse       
  17: gui_widgets             39: mpm88_graph             61: texture_graph       
  18: implicit_fem            40: mpm99                   62: tutorial            
  19: implicit_mass_spring    41: mpm_lagrangian_forces   63: vortex_rings        
  20: initial_value_problem   42: nbody                   64: waterwave           
  21: jacobian                43: odop_solar                                      
 ──────────────────────────────────────────────────────────────────────────────── 
Running example minimal ...
[Taichi] Starting on arch=x64
42.0
>>> Running time: 0.36s
42

Consider attaching this log when maintainers ask about system information.
>>> Running time: 27.78s
ailzhang commented 2 years ago

Thanks for the bug report! There might be some place that we accidentally used GPU0 by default and we should get it fixed.

Morcki commented 2 years ago

Add this line os.environ['TI_VISIBLE_DEVICE'] = '1' which will make your vulkan instance to pick the number 1 device

But still I think we should unify the value of TI_VISIBLE_DEVICE and CUDA_VISIBLE_DEVICES if only one is indicated.

Refer to this pr #5910

zhenjia-xu commented 2 years ago

Adding os.environ['TI_VISIBLE_DEVICE'] = '1' still doesn't work.

bobcao3 commented 2 years ago

Adding os.environ['TI_VISIBLE_DEVICE'] = '1' still doesn't work.

TI_VISIBLE_DEVICE needs to be 2 I think in your case. Anyways, GGUI is not really tested with multi-GPU setups, there might be bugs

zhenjia-xu commented 2 years ago

wow, TI_VISIBLE_DEVICE='2' works! the reason is that Vulkan device 0 is CPU and GPU index starts from 1.