taichi-dev / taichi

Productive, portable, and performant GPU programming in Python.
https://taichi-lang.org
Apache License 2.0
25.36k stars 2.27k forks source link

Crash in OpenGL with from_numpy #6922

Open whorfin opened 1 year ago

whorfin commented 1 year ago

Describe the bug from_numpy throws an GL_INVALID_OPERATION error and core dumps with OpenGL backend for fields of certain sizes. Fields of same size work fine with Vulkan. This is on KBL GT2 i915

To Reproduce

#!/usr/bin/python3
import sys
import taichi as ti
import numpy as np

#ti.init(arch=ti.vulkan)    # works
ti.init(arch=ti.opengl)     # fails

print("[ ] allocating",end="")
sys.stdout.flush()
target_np = np.full((4096, 4096, 3), .5)    # fails w/ opengl
#target_np = np.full((2048, 2048, 3), .5)    # works w/ opengl
target = ti.Vector.field(3, ti.f32, shape=(target_np.shape[0], target_np.shape[1]))
print("\r[+")
sys.stdout.flush()

print("[ ] numpy assign astype",end="")
sys.stdout.flush()
target_np = target_np.astype(np.float32)
print("\r[+")
sys.stdout.flush()

print("[ ] taichi from_numpy",end="")
sys.stdout.flush()
target.from_numpy(target_np)    # this is what fails
print("\r[+")
sys.stdout.flush()

Log/Screenshots Using the smaller allocation or Vulkan backend, as indicated in comments, everything works fine. As submitted above it crashes:

$ python3 whorfin-testogl-submit.py
[Taichi] version 1.3.0, llvm 15.0.4, commit 0f25b95e, linux, python 3.10.6
[Taichi] Starting on arch=opengl
[+] allocating
[+] numpy assign astype
[ ] taichi from_numpy[E 12/18/22 12:55:57.432 73408] [opengl_device.cpp:check_opengl_error@181] glDispatchCompute: GL_INVALID_VALUE

Traceback (most recent call last):
  File "/home/whorfin/whorfin art/taichi/electrostatic/whorfin-testogl-submit.py", line 25, in <module>
    target.from_numpy(target_np)    # this is what fails
  File "/usr/local/lib/python3.10/dist-packages/taichi/lang/util.py", line 298, in wrapped
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/taichi/lang/matrix.py", line 1666, in from_numpy
    self._from_external_arr(arr)
  File "/usr/local/lib/python3.10/dist-packages/taichi/lang/util.py", line 298, in wrapped
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/taichi/lang/matrix.py", line 1650, in _from_external_arr
    ext_arr_to_matrix(arr, self, as_vector)
  File "/usr/local/lib/python3.10/dist-packages/taichi/lang/kernel_impl.py", line 945, in wrapped
    return primal(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/taichi/lang/kernel_impl.py", line 872, in __call__
    return self.runtime.compiled_functions[key](*args)
  File "/usr/local/lib/python3.10/dist-packages/taichi/lang/kernel_impl.py", line 797, in func__
    raise e from None
  File "/usr/local/lib/python3.10/dist-packages/taichi/lang/kernel_impl.py", line 794, in func__
    t_kernel(launch_ctx)
RuntimeError: [opengl_device.cpp:check_opengl_error@181] glDispatchCompute: GL_INVALID_VALUE
[E 12/18/22 12:55:57.515 73408] [opengl_device.cpp:check_opengl_error@181] glBindBufferBase: GL_INVALID_OPERATION

[E 12/18/22 12:55:57.516 73408] [opengl_device.cpp:check_opengl_error@181] glBindBufferBase: GL_INVALID_OPERATION

terminate called after throwing an instance of 'std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >'
Aborted (core dumped)

Additional comments I don't know what's going on here, seems to be a bind failure, is it possible the initial allocation failed but was not checked? An error message closer to the actual error would be excellent if possible, ie if it is the initial field allocation?

FWIW:

$ glxinfo -l | grep GL_MAX_TEXTURE_SIZE
    GL_MAX_TEXTURE_SIZE = 16384
    GL_MAX_TEXTURE_SIZE = 16384
$ ti diagnose
[Taichi] version 1.3.0, llvm 15.0.4, commit 0f25b95e, linux, python 3.10.6

*******************************************
**      Taichi Programming Language      **
*******************************************

Docs:   https://docs.taichi-lang.org/
GitHub: https://github.com/taichi-dev/taichi/
Forum:  https://forum.taichi.graphics/

Taichi system diagnose:

python: 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0]
system: linux
executable: /usr/bin/python3
platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.35
architecture: 64bit ELF
uname: uname_result(system='Linux', node='shiv', release='5.15.0-56-generic', version='#62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022', machine='x86_64')
locale: en_US.UTF-8
PATH: /home/whorfin/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PYTHONPATH: ['/usr/local/bin', '/usr/lib/python310.zip', '/usr/lib/python3.10', '/usr/lib/python3.10/lib-dynload', '/usr/local/lib/python3.10/dist-packages', '/usr/lib/python3/dist-packages']

Distributor ID: Ubuntu
Description:    Ubuntu 22.04.1 LTS
Release:    22.04
Codename:   jammy

import: <module 'taichi' from '/usr/local/lib/python3.10/dist-packages/taichi/__init__.py'>

cc: False
cpu: True
metal: False
opengl: True
cuda: False
vulkan: True

`glewinfo` not available: [Errno 2] No such file or directory: 'glewinfo'

`nvidia-smi` not available: [Errno 2] No such file or directory: 'nvidia-smi'
[Taichi] version 1.3.0, llvm 15.0.4, commit 0f25b95e, linux, python 3.10.6

[Taichi] version 1.3.0, llvm 15.0.4, commit 0f25b95e, linux, python 3.10.6
[Taichi] Starting on arch=x64

[Taichi] version 1.3.0, llvm 15.0.4, commit 0f25b95e, linux, python 3.10.6
[Taichi] Starting on arch=opengl

[W 12/18/22 12:50:17.333 73068] [cuda_driver.cpp:load_lib@36] libcuda.so lib not found.
[W 12/18/22 12:50:17.333 73068] [misc.py:adaptive_arch_select@766] Arch=[<Arch.cuda: 5>] is not supported, falling back to CPU
[Taichi] version 1.3.0, llvm 15.0.4, commit 0f25b95e, linux, python 3.10.6
[Taichi] Starting on arch=x64

[Taichi] version 1.3.0, llvm 15.0.4, commit 0f25b95e, linux, python 3.10.6

*******************************************
**      Taichi Programming Language      **
*******************************************

Docs:   https://docs.taichi-lang.org/
GitHub: https://github.com/taichi-dev/taichi/
Forum:  https://forum.taichi.graphics/

                                   TAICHI EXAMPLES                                    
 ──────────────────────────────────────────────────────────────────────────────────── 
  0: ad_gravity               24: laplace                 48: physarum                
  1: comet                    25: laplace_equation        49: print_offset            
  2: cornell_box              26: mandelbrot_zoom         50: rasterizer              
  3: diff_sph                 27: marching_squares        51: regression              
  4: euler                    28: mass_spring_3d_ggui     52: sdf_renderer            
  5: explicit_activation      29: mass_spring_game        53: simple_derivative       
  6: export_mesh              30: mass_spring_game_ggui   54: simple_texture          
  7: export_ply               31: mciso_advanced          55: simple_uv               
  8: export_videos            32: mgpcg                   56: snow_phaseField         
  9: fem128                   33: mgpcg_advanced          57: stable_fluid            
  10: fem128_ggui             34: minimal                 58: stable_fluid_ggui       
  11: fem99                   35: minimization            59: stable_fluid_graph      
  12: fractal                 36: mpm128                  60: taichi_bitmasked        
  13: fractal3d_ggui          37: mpm128_ggui             61: taichi_dynamic          
  14: fullscreen              38: mpm3d                   62: taichi_logo             
  15: game_of_life            39: mpm3d_ggui              63: taichi_ngp              
  16: gui_image_io            40: mpm88                   64: taichi_sparse           
  17: gui_widgets             41: mpm88_graph             65: texture_graph           
  18: implicit_fem            42: mpm99                   66: tutorial                
  19: implicit_mass_spring    43: mpm_lagrangian_forces   67: two_stream_instability  
  20: initial_value_problem   44: nbody                   68: vortex_rings            
  21: jacobian                45: odop_solar              69: waterwave               
  22: karman_vortex_street    46: patterns                                            
  23: keyboard                47: pbf2d                                               
 ──────────────────────────────────────────────────────────────────────────────────── 
42
Running example minimal ...
[Taichi] Starting on arch=x64
42.0
>>> Running time: 0.42s

Consider attaching this log when maintainers ask about system information.
>>> Running time: 7.09s
erizmr commented 1 year ago

Hi @whorfin , sorry for the late reply. I didn't reproduce the issue on my own machine. Could you please provide more information about your hardware? Thanks.

image

whorfin commented 1 year ago
System:
  Host: hostname Kernel: 5.15.0-56-generic x86_64 bits: 64 Desktop: LXQt 0.17.1
  Distro: Ubuntu 22.04.1 LTS (Jammy Jellyfish)
Machine:
  Type: Laptop System: Razer product: Blade Stealth v: 2.04
    serial: <superuser required>
  Mobo: Razer model: Razer serial: <superuser required> UEFI: Razer v: 8.02
    date: 02/22/2018
CPU:
  Info: dual core model: Intel Core i7-7500U bits: 64 type: MT MCP cache:
    L2: 512 KiB
  Speed (MHz): avg: 1083 min/max: 400/3500 cores: 1: 700 2: 700 3: 1023
    4: 1911
Graphics:
  Device-1: Intel HD Graphics 620 driver: i915 v: kernel

my hosts which have nvidia GPUs have not shown this particular behavior fwiw

if you change the "4096, 4096" allocation to 8K you might see if you repro?

erizmr commented 1 year ago

I tried with 8192, 8192 but still fail to repro on my host with a nvidia GPU. I am not sure whether it is a Intel HD Graphics 620 specific problem.

whorfin commented 1 year ago

I was not able to repro on nvidia