taichi-dev / taichi

Productive, portable, and performant GPU programming in Python.
https://taichi-lang.org
Apache License 2.0
25.36k stars 2.27k forks source link

Vulkan issue on ubuntu 22.04.1 #5974

Open whorfin opened 2 years ago

whorfin commented 2 years ago

Describe the bug This all started with the issue reported as #3544. The same code, which with the ti.sync() worked on Vulkan, fails on the same machine when upgraded to Ubuntu 22.04.1 from 20.04.1. So perhaps a Vulkan version or kernel issue. When upgraded from Taichi 1.0.0 to 1.1.2, taichi itself now Aborts. With more complicated code it seems as if the sync() call is always returning instantly

To Reproduce

#!/usr/bin/python3
import sys

import math

import taichi as ti

import numpy as np

from time import monotonic

#ti.init(arch=ti.opengl) # fine
ti.init(arch=ti.vulkan) # over-schedules kernels unless ti.sync() is done

fieldWidth = 1024
fieldHeight = 688

field_chunk = 32

@ti.func
def samplePeriodic(field: ti.template(), u, v):
    P = ti.Vector([int(u), int(v)])
    shape = ti.Vector(field.shape)
    P = ti.raw_mod(P, shape)
    return field[int(P)]

@ti.kernel
def initialize():
    for x in range(in_field.shape[0]):
        for y in range(in_field.shape[1]):
            in_field[x,y] = ti.sin(x/10 * math.pi) * ti.sin(y/5 * math.pi)

@ti.kernel
def compute_chunked(yi: ti.i32, yn: ti.i32):
    for px, py in in_field:
        F = 0.
        for x in range(out_field.shape[0]):
            for y in range(yi, yn):
                Q = samplePeriodic(in_field, x, y)
                F += Q
        out_field[px, py] += F

in_field = ti.field(ti.f32, shape=(fieldWidth, fieldHeight))
out_field = ti.field(ti.f32, shape=(fieldWidth, fieldHeight))

initialize()
print("Wait...", end="")
sys.stdout.flush()

out_field.fill(0.)
numy = int(ti.ceil(out_field.shape[1]/field_chunk))
last = monotonic()
for i in range(0, numy):
    now = monotonic()
    print("{}/{}[{:#.2f}]...".format(i+1, numy, now - last), end="")
    last = now
    sys.stdout.flush()
    compute_chunked(i*field_chunk, 
            min(out_field.shape[1], (i+1)*field_chunk))
    ti.sync()   # Vulkan fails without this

print()
ti.tools.imshow(out_field.to_numpy())

Log/Screenshots

$ python3  whorfin-test-submit.py
[Taichi] version 1.1.2, llvm 10.0.0, commit f25cf4a2, linux, python 3.10.4
[Taichi] Starting on arch=vulkan
[I 09/04/22 14:26:47.585 10725] [vulkan_device_creator.cpp:pick_physical_device@372] Found Vulkan Device 0 (Intel(R) HD Graphics 620 (KBL GT2))
[I 09/04/22 14:26:47.585 10725] [vulkan_device_creator.cpp:pick_physical_device@372] Found Vulkan Device 1 (llvmpipe (LLVM 13.0.1, 256 bits))
[I 09/04/22 14:26:47.586 10725] [vulkan_device_creator.cpp:create_logical_device@440] Vulkan Device "Intel(R) HD Graphics 620 (KBL GT2)" supports Vulkan 0 version 1.3.204
Wait...1/22[0.00]...2/22[0.72]...[E 09/04/22 14:26:48.370 10725] [vulkan_device.cpp:submit@1697] Vulkan Error : -4 : failed to submit command buffer

Traceback (most recent call last):
  File "/home/whorfin/whorfin art/taichi/electrostatic/./whorfin-test-submit.py", line 61, in <module>
    ti.sync()   # Vulkan fails without this
  File "/usr/local/lib/python3.10/dist-packages/taichi/lang/runtime_ops.py", line 8, in sync
    impl.get_runtime().sync()
  File "/usr/local/lib/python3.10/dist-packages/taichi/lang/impl.py", line 384, in sync
    self.prog.synchronize()
RuntimeError: [vulkan_device.cpp:submit@1697] Vulkan Error : -4 : failed to submit command buffer
[E 09/04/22 14:26:48.435 10725] [vulkan_device.cpp:submit@1697] Vulkan Error : -4 : failed to submit command buffer

[E 09/04/22 14:26:48.436 10725] [vulkan_device.cpp:submit@1697] Vulkan Error : -4 : failed to submit command buffer

terminate called after throwing an instance of 'std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >'
Aborted (core dumped)

Additional comments

$ ti diagnose
[Taichi] version 1.1.2, llvm 10.0.0, commit f25cf4a2, linux, python 3.10.4

*******************************************
**      Taichi Programming Language      **
*******************************************

Docs:   https://docs.taichi-lang.org/
GitHub: https://github.com/taichi-dev/taichi/
Forum:  https://forum.taichi.graphics/

Taichi system diagnose:

python: 3.10.4 (main, Jun 29 2022, 12:14:53) [GCC 11.2.0]
system: linux
executable: /usr/bin/python3
platform: Linux-5.15.0-47-generic-x86_64-with-glibc2.35
architecture: 64bit ELF
uname: uname_result(system='Linux', node='shiv', release='5.15.0-47-generic', version='#51-Ubuntu SMP Thu Aug 11 07:51:15 UTC 2022', machine='x86_64')
locale: en_US.UTF-8
PATH: /home/whorfin/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/whorfin/.fzf/bin:/usr/local/texlive/2018/bin/x86_64-linux:/home/whorfin/.cargo/bin
PYTHONPATH: ['/usr/local/bin', '/usr/lib/python310.zip', '/usr/lib/python3.10', '/usr/lib/python3.10/lib-dynload', '/usr/local/lib/python3.10/dist-packages', '/usr/lib/python3/dist-packages']

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.1 LTS
Release:        22.04
Codename:       jammy

import: <module 'taichi' from '/usr/local/lib/python3.10/dist-packages/taichi/__init__.py'>

cc: False
cpu: True
metal: False
opengl: True
cuda: False
vulkan: True

`glewinfo` not available: [Errno 2] No such file or directory: 'glewinfo'

`nvidia-smi` not available: [Errno 2] No such file or directory: 'nvidia-smi'
[Taichi] version 1.1.2, llvm 10.0.0, commit f25cf4a2, linux, python 3.10.4

[Taichi] version 1.1.2, llvm 10.0.0, commit f25cf4a2, linux, python 3.10.4
[Taichi] Starting on arch=x64

[Taichi] version 1.1.2, llvm 10.0.0, commit f25cf4a2, linux, python 3.10.4
[Taichi] Starting on arch=opengl

[W 09/04/22 14:30:06.758 10957] [cuda_driver.cpp:CUDADriver@39] CUDA driver not found.
[W 09/04/22 14:30:06.759 10957] [misc.py:adaptive_arch_select@750] Arch=[<Arch.cuda: 5>] is not supported, falling back to CPU
[Taichi] version 1.1.2, llvm 10.0.0, commit f25cf4a2, linux, python 3.10.4
[Taichi] Starting on arch=x64

[Taichi] version 1.1.2, llvm 10.0.0, commit f25cf4a2, linux, python 3.10.4

*******************************************
**      Taichi Programming Language      **
*******************************************

Docs:   https://docs.taichi-lang.org/
GitHub: https://github.com/taichi-dev/taichi/
Forum:  https://forum.taichi.graphics/

                                 TAICHI EXAMPLES
 ────────────────────────────────────────────────────────────────────────────────
  0: ad_gravity               22: keyboard                44: patterns
  1: comet                    23: laplace                 45: pbf2d
  2: cornell_box              24: mandelbrot_zoom         46: physarum
  3: diff_sph                 25: marching_squares        47: print_offset
  4: euler                    26: mass_spring_3d_ggui     48: rasterizer
  5: explicit_activation      27: mass_spring_game        49: regression
  6: export_mesh              28: mass_spring_game_ggui   50: sdf_renderer
  7: export_ply               29: mciso_advanced          51: simple_derivative
  8: export_videos            30: mgpcg                   52: simple_texture
  9: fem128                   31: mgpcg_advanced          53: simple_uv
  10: fem128_ggui             32: minimal                 54: stable_fluid
  11: fem99                   33: minimization            55: stable_fluid_ggui
  12: fractal                 34: mpm128                  56: stable_fluid_graph
  13: fractal3d_ggui          35: mpm128_ggui             57: taichi_bitmasked
  14: fullscreen              36: mpm3d                   58: taichi_dynamic
  15: game_of_life            37: mpm3d_ggui              59: taichi_logo
  16: gui_image_io            38: mpm88                   60: taichi_sparse
  17: gui_widgets             39: mpm88_graph             61: texture_graph
  18: implicit_fem            40: mpm99                   62: tutorial
  19: implicit_mass_spring    41: mpm_lagrangian_forces   63: vortex_rings
  20: initial_value_problem   42: nbody                   64: waterwave
  21: jacobian                43: odop_solar
 ────────────────────────────────────────────────────────────────────────────────
Running example minimal ...
[Taichi] Starting on arch=x64
42.0
>>> Running time: 0.32s
42

Consider attaching this log when maintainers ask about system information.
>>> Running time: 5.93s
neozhaoliang commented 1 year ago

@whorfin I tested your code on Ubuntu22.04.5 (LTS) with Taichi v1.1.3, Python3.8. It runs well and didn't show the error you posted above, though I only get a blank image. Could you please check if your pip is the most up-to-date, update taichi to v1.1.3 and try it again? (or maybe you should also update your system to the minor version 22.04.5)

whorfin commented 1 year ago

Thank you for following up I'm on 22.04.1 (LTS) with Python3.10 I upgraded taichi to v1.1.3 [reported was 1.1.2] and it still fails I tried it on a second machine which has a brand new install of Ubuntu 22.04.1, brand new pip and taichi install of v1.1.3 and a different Intel "GPU" and it fails similarly

I think this must have something to do with Vulkan on Intel GPU's; it runs fine on my machines with Nvidia GPUs

The original report is from a machine with "Intel(R) HD Graphics 620 (KBL GT2)" Second reproduction on a machine with "Intel(R) HD Graphics 4400 (HSW GT2)" both self-report as "supports Vulkan version 1.3.204"

The release I'm using on these lower-powered laptops has not gone to .5 yet

Also in case it is relevant, I see "HUNG GPU" in dmesg It seems perhaps as if ti.sync() isn't quite working?

neozhaoliang commented 1 year ago

Oh yes, I used an Nvidia card, not Intel's GT2. I don't have a machine with both ubuntu22.04 and intel card to test it.

bobcao3 commented 1 year ago

Hung GPU means the GPU crashed, probably a drivers issue. Vulkan on older Intel graphics devices is really spotty

whorfin commented 1 year ago

It did used to work... As mentioned with the reference to #3544, "GPU hung" also meant the GPU just didn't respond in time, and the "chunking" approach brought the load down to where things worked. Thus the importance of ti.sync() So I guess I'm still holding out hope there's something up with ti.sync() or kernel scheduling which will get addressed, potentially bad interactions with the more recent kernel or drivers. Thank you for your assistance

neozhaoliang commented 1 year ago

I tested your code on another machine with ubuntu22.04.5 + Python3.10 + Intel GT2, still cannot reproduce the problem.

The code runs both with or without ti.sync().

The info says "Vulkan Device "Intel(R) Xe Graphics (TGL GT2)" supports Vulkan 0 version 1.3.204".

whorfin commented 1 year ago

Thank you ! Running without ti.sync() might be a clue that the generation (TGL vs KBL and HBL) is modern enough to run OK without chunking. Turning up the fieldWidth and height might tip it over to being necessary again. This all encouraging, but maybe points to 22.04.1 vs 22.04.5 driver differences? 20.04.1 worked fine though on this same hardware.

bobcao3 commented 1 year ago

KBL: Kaby Lake, Gen 9.5 graphics TGL: Tiger Lake, Gen 12.1 graphics (Intel Xe) This is a big difference! These two are very different GPUs.

In addition, the Haswell device (Gen 7) now has been split from the ANV driver into its own driver. Even before the split, it is also very different