np.float32 smallest subnormal set to zero after `ti.init`

Datamance commented 1 year ago

Describe the bug The smallest subnormal of the numpy.float32 dtype is set to zero after ti.init, causing computations to fail completely. Oddly, the warning disappears and the correct subnormal of 1.4012985e-45 gets printed if you print(np.finfo(np.float32)) before initializing taichi, but the computation still fails regardless.

EDIT I see that one issue was how I was initializing the response container - so at least I get results now! I'm just wondering now if the results are incongruent due to my logic or this subnormal issue...

To Reproduce

import taichi as ti
import numpy as np
from numpy.typing import NDArray

def ti_volume_correlation(kernel_slab: NDArray, stimulus: NDArray):
    """Use Taichi to do a volume convolution.

    This should wrap the process of creating ti.field objects and whatnot.
    """
    # Response container.
    # response_container = ti.field(dtype=ti.f32, shape=stimulus.shape[2] - kernel_size)
    kernel_size = kernel_slab.shape[2]
    batch_count = stimulus.shape[2] - kernel_size
    # Critically, batch_count is the length of the response container.
    response_container = np.ascontiguousarray(np.zeros(batch_count), dtype=np.float32)

    # Should modify in place.
    _gpu_convolve(
        response_container,
        np.ascontiguousarray(kernel_slab, dtype=np.float32),
        np.ascontiguousarray(stimulus, dtype=np.float32),
        batch_count,
    )

    return response_container

@ti.kernel
def _gpu_convolve(
    response_container: ti.types.ndarray(dtype=ti.f32, ndim=1),
    kernel_slab: ti.types.ndarray(dtype=ti.f32, ndim=3),
    stimulus: ti.types.ndarray(dtype=ti.f32, ndim=3),
    batch_count: int,
):
    for x_idx, y_idx, t_idx in kernel_slab:
        for t_offset in range(batch_count):
            product = (
                kernel_slab[x_idx, y_idx, t_idx]
                * stimulus[x_idx, y_idx, t_idx + t_offset]
            )
            response_container[t_offset] += product

def slow_python_volume_correlation(kernel_slab: NDArray, stimulus: NDArray):
    kernel_size = kernel_slab.shape[2]
    response = np.zeros(stimulus.shape[2] - kernel_size)
    # Manual convolution because I'm in a hurry and don't have time to figure out numpy's bullshit
    for idx, start in enumerate(range(stimulus.shape[2] - kernel_size)):
        if idx % 100 == 0:
            print(f"Slab #{idx + 1} started.")
        response[idx] = np.sum(
            kernel_slab * stimulus[:, :, start : start + kernel_size]
        )
    return response

if __name__ == "__main__":
    # print("before TI init")
    # print(np.finfo(np.float32))
    ti.init(arch=ti.metal)
    print("after TI init")
    print(np.finfo(np.float32))

    rng = np.random.default_rng(100)

    kernel_slab = rng.normal(size=(250, 250, 400))
    signal = rng.normal(size=(250, 250, 500))

    response = ti_volume_correlation(kernel_slab, signal)
    print(response.shape)
    print(response.max())
    print(response.min())

    response_2 = slow_python_volume_correlation(kernel_slab, signal)
    print(response_2.shape)
    print(response_2.max())
    print(response_2.min())

Log/Screenshots

❯ python prove_taichi_broken.py
[Taichi] version 1.6.0, llvm 15.0.7, commit f1c6fbbd, osx, python 3.11.5
[Taichi] Starting on arch=metal
after TI init
Machine parameters for float32
---------------------------------------------------------------
precision =   6   resolution = 1.0000000e-06
machep =    -23   eps =        1.1920929e-07
negep =     -24   epsneg =     5.9604645e-08
minexp =   -126   tiny =       1.1754944e-38
maxexp =    128   max =        3.4028235e+38
nexp =        8   min =        -max
smallest_normal = 1.1754944e-38   smallest_subnormal = 0.0000000e+00
---------------------------------------------------------------

/Users/ricorodriguez/GradSchool/VHLab/CatDiffusion/env/lib/python3.11/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/Users/ricorodriguez/GradSchool/VHLab/CatDiffusion/env/lib/python3.11/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
(200,)
435.3192
-568.0371
Slab #1 started.
Slab #101 started.
(200,)
12459.768944424823
-15875.401244324563

Additional comments Output from ti diagnose:

[Taichi] version 1.6.0, llvm 15.0.7, commit f1c6fbbd, osx, python 3.11.5

*******************************************
**      Taichi Programming Language      **
*******************************************

Docs:   https://docs.taichi-lang.org/
GitHub: https://github.com/taichi-dev/taichi/
Forum:  https://forum.taichi.graphics/

Taichi system diagnose:

python: 3.11.5 (main, Aug 31 2023, 23:43:57) [Clang 14.0.3 (clang-1403.0.22.14.1)]
system: darwin
executable: /Users/ricorodriguez/GradSchool/VHLab/CatDiffusion/env/bin/python3.11
platform: macOS-14.0-arm64-arm-64bit
architecture: 64bit
uname: uname_result(system='Darwin', node='datamancer.lan', release='23.0.0', version='Darwin Kernel Version 23.0.0: Fri Sep 15 14:41:43 PDT 2023; root:xnu-10002.1.13~1/RELEASE_ARM64_T6000', machine='arm64')
/Users/ricorodriguez/GradSchool/VHLab/CatDiffusion/env/lib/python3.11/site-packages/taichi/tools/diagnose.py:20: DeprecationWarning: Use setlocale(), getencoding() and getlocale() instead
  print(f'locale: {".".join(locale.getdefaultlocale())}')
locale: en_US.UTF-8
PATH: /Users/ricorodriguez/GradSchool/VHLab/CatDiffusion/env/bin:/Users/ricorodriguez/.asdf/shims:/opt/homebrew/opt/asdf/libexec/bin:/usr/local/share/google-cloud-sdk/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/Users/ricorodriguez/bin:/Users/ricorodriguez/.local/bin:/usr/local/sbin:/usr/local/opt/openssl@1.1/bin:/opt/homebrew/opt/sqlite/bin:/Users/ricorodriguez/.poetry/bin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/opt/X11/bin:/Library/Apple/usr/bin:/usr/local/MacGPG2/bin:/Library/TeX/texbin:/Applications/Little Snitch.app/Contents/Components:/usr/local/go/bin:/Users/ricorodriguez/.fig/bin:/Users/ricorodriguez/.local/bin:/opt/homebrew/opt/fzf/bin
PYTHONPATH: ['/Users/ricorodriguez/GradSchool/VHLab/CatDiffusion/env/bin', '/Users/ricorodriguez/.asdf/installs/python/3.11.5/Library/Frameworks/Python.framework/Versions/3.11/lib/python311.zip', '/Users/ricorodriguez/.asdf/installs/python/3.11.5/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11', '/Users/ricorodriguez/.asdf/installs/python/3.11.5/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/lib-dynload', '/Users/ricorodriguez/GradSchool/VHLab/CatDiffusion/env/lib/python3.11/site-packages']

`lsb_release` not available: [Errno 2] No such file or directory: 'lsb_release'

import: <module 'taichi' from '/Users/ricorodriguez/GradSchool/VHLab/CatDiffusion/env/lib/python3.11/site-packages/taichi/__init__.py'>

cc: False
cpu: True
metal: True
opengl: False
cuda: False
vulkan: True

`glewinfo` not available: [Errno 2] No such file or directory: 'glewinfo'

`nvidia-smi` not available: [Errno 2] No such file or directory: 'nvidia-smi'
[Taichi] version 1.6.0, llvm 15.0.7, commit f1c6fbbd, osx, python 3.11.5

[Taichi] version 1.6.0, llvm 15.0.7, commit f1c6fbbd, osx, python 3.11.5
[Taichi] Starting on arch=arm64

[W 10/04/23 14:14:37.549 541207] [misc.py:adaptive_arch_select@753] Arch=[<Arch.opengl: 6>] is not supported, falling back to CPU
[Taichi] version 1.6.0, llvm 15.0.7, commit f1c6fbbd, osx, python 3.11.5
[Taichi] Starting on arch=arm64

[W 10/04/23 14:14:37.807 541268] [misc.py:adaptive_arch_select@753] Arch=[<Arch.cuda: 4>] is not supported, falling back to CPU
[Taichi] version 1.6.0, llvm 15.0.7, commit f1c6fbbd, osx, python 3.11.5
[Taichi] Starting on arch=arm64

[Taichi] version 1.6.0, llvm 15.0.7, commit f1c6fbbd, osx, python 3.11.5

*******************************************
**      Taichi Programming Language      **
*******************************************

Docs:   https://docs.taichi-lang.org/
GitHub: https://github.com/taichi-dev/taichi/
Forum:  https://forum.taichi.graphics/

                                   TAICHI EXAMPLES
 ────────────────────────────────────────────────────────────────────────────────────
  0: ad_gravity               25: keyboard                50: pbf2d
  1: circle_packing_image     26: laplace                 51: physarum
  2: comet                    27: laplace_equation        52: poisson_disk_sampling
  3: cornell_box              28: mandelbrot_zoom         53: print_offset
  4: diff_sph                 29: marching_squares        54: rasterizer
  5: euler                    30: mass_spring_3d_ggui     55: regression
  6: eulerfluid2d             31: mass_spring_game        56: sdf_renderer
  7: explicit_activation      32: mass_spring_game_ggui   57: simple_derivative
  8: export_mesh              33: mciso_advanced          58: simple_texture
  9: export_ply               34: mgpcg                   59: simple_uv
  10: export_videos           35: mgpcg_advanced          60: snow_phaseField
  11: fem128                  36: minimal                 61: stable_fluid
  12: fem128_ggui             37: minimization            62: stable_fluid_ggui
  13: fem99                   38: mpm128                  63: stable_fluid_graph
  14: fractal                 39: mpm128_ggui             64: taichi_bitmasked
  15: fractal3d_ggui          40: mpm3d                   65: taichi_dynamic
  16: fullscreen              41: mpm3d_ggui              66: taichi_logo
  17: game_of_life            42: mpm88                   67: taichi_ngp
  18: gui_image_io            43: mpm88_graph             68: taichi_sparse
  19: gui_widgets             44: mpm99                   69: texture_graph
  20: implicit_fem            45: mpm_lagrangian_forces   70: tutorial
  21: implicit_mass_spring    46: nbody                   71: two_stream_instability
  22: initial_value_problem   47: odop_solar              72: vortex_rings
  23: jacobian                48: oit_renderer            73: waterwave
  24: karman_vortex_street    49: patterns
 ────────────────────────────────────────────────────────────────────────────────────
42
Running example minimal ...
[Taichi] Starting on arch=arm64
42.0
>>> Running time: 0.11s

Consider attaching this log when maintainers ask about system information.
>>> Running time: 2.91s

Datamance commented 1 year ago

Ah, nevermind - it appears that I have a race condition going on, as indicated by running with the CPU backend.

I'm completely new to GPU programming - what is the recommendation here? Should I use intermediate containers to store the results of a given block? Or is there some other pattern that I should be following?

Also, I'm still curious about the original warning - where could that be coming from?

Datamance commented 1 year ago

Here's what I'm doing now - is this a good approach? It seems to work, although obviously there's some loss of precision with float32.

def ti_volume_convolution(kernel_slab: NDArray, stimulus: NDArray):
    """Use Taichi to do a volume convolution.

    This should wrap the process of creating ti.field objects and whatnot.
    """
    # Response container.
    # response_container = ti.field(dtype=ti.f32, shape=stimulus.shape[2] - kernel_size)
    width, height, kernel_size = kernel_slab.shape
    stimulus_length = stimulus.shape[2]
    batch_count = stimulus_length - kernel_size
    # Critically, batch_count is the length of the response container.
    compute_cells = np.ascontiguousarray(
        np.zeros(shape=(kernel_size, batch_count), dtype=np.float32)
    )
    response_container = np.ascontiguousarray(np.zeros(batch_count), dtype=np.float32)

    # Should modify in place.
    _gpu_convolve(
        response_container,
        compute_cells,
        np.ascontiguousarray(kernel_slab, dtype=np.float32),
        np.ascontiguousarray(stimulus, dtype=np.float32),
        width,
        height,
        kernel_size,
    )

    return response_container

@ti.kernel
def _gpu_convolve(
    response_container: ti.types.ndarray(dtype=ti.f32, ndim=1),
    compute_cells: ti.types.ndarray(dtype=ti.f32, ndim=2),
    kernel_slab: ti.types.ndarray(dtype=ti.f32, ndim=3),
    stimulus: ti.types.ndarray(dtype=ti.f32, ndim=3),
    width: ti.int32,
    height: ti.int32,
    kernel_size: ti.int32,
):
    """Brute force GPU convolution.

    Note that in taichi kernels, only the outermost loops are parallelized.

    Want to parallelize:
    - computation of piecewise products
    - Summation of convolved slabs
    """
    for t_idx, batch_idx in compute_cells:
        for x_idx in range(width):
            for y_idx in range(height):
                compute_cells[t_idx, batch_idx] += (
                    kernel_slab[x_idx, y_idx, t_idx]
                    * stimulus[x_idx, y_idx, t_idx + batch_idx]
                )
    for batch_idx in response_container:
        batch_total = 0
        for t_idx in range(kernel_size):
            batch_total += compute_cells[t_idx, batch_idx]
        response_container[batch_idx] = batch_total

jim19930609 commented 1 year ago

Hi Datamance, Awesome work! Can you be more specific about the fp32 precision issue? Is it somewhat related to overflow or underflow?

Should I use intermediate containers to store the results of a given block? Or is there some other pattern that I should be following?

I think you're doing a good job so far - the goal of Taichi is to make sure the parallel codes work efficiently without having to worry about these staffs. There are some more advanced patterns like ti.simt.SharedArray, but it's very trivial and I would only recommend if you're seeking for extreme performance

Datamance commented 1 year ago

@jim19930609 So the exact warning was

UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.

It sounds like it might be the same issue that this person was facing in another context, but I can't be sure. Here's the key part of what they wrote:

Some Googling led me to this issue, which pointed toward some shared library compiled with the gcc/clang option -ffast-math that was being loaded as the culprit. It turns out (somewhat insanely) that when -ffast-math is enabled, the compiler will link in a constructor that sets the FTZ/DAZ flags whenever the library is loaded — even on shared libraries, which means that any application that loads that library will have its floating point behavior changed for the whole process. And -Ofast, which sounds appealingly like a "make my program go fast" flag, automatically enables -ffast-math, so some projects may unwittingly turn it on without realizing the implications.

Does taichi depend on any shared libraries that might do this, or something like it?

jim19930609 commented 1 year ago

Interesting...That was correct, taichi does use fast math by default and this is likely the reason.

But honestly that sounds more like an issue with fast math itself

taichi-dev / taichi

np.float32 smallest subnormal set to zero after `ti.init` #8357