Open Datamance opened 1 year ago
Ah, nevermind - it appears that I have a race condition going on, as indicated by running with the CPU backend.
I'm completely new to GPU programming - what is the recommendation here? Should I use intermediate containers to store the results of a given block? Or is there some other pattern that I should be following?
Also, I'm still curious about the original warning - where could that be coming from?
Here's what I'm doing now - is this a good approach? It seems to work, although obviously there's some loss of precision with float32.
def ti_volume_convolution(kernel_slab: NDArray, stimulus: NDArray):
"""Use Taichi to do a volume convolution.
This should wrap the process of creating ti.field objects and whatnot.
"""
# Response container.
# response_container = ti.field(dtype=ti.f32, shape=stimulus.shape[2] - kernel_size)
width, height, kernel_size = kernel_slab.shape
stimulus_length = stimulus.shape[2]
batch_count = stimulus_length - kernel_size
# Critically, batch_count is the length of the response container.
compute_cells = np.ascontiguousarray(
np.zeros(shape=(kernel_size, batch_count), dtype=np.float32)
)
response_container = np.ascontiguousarray(np.zeros(batch_count), dtype=np.float32)
# Should modify in place.
_gpu_convolve(
response_container,
compute_cells,
np.ascontiguousarray(kernel_slab, dtype=np.float32),
np.ascontiguousarray(stimulus, dtype=np.float32),
width,
height,
kernel_size,
)
return response_container
@ti.kernel
def _gpu_convolve(
response_container: ti.types.ndarray(dtype=ti.f32, ndim=1),
compute_cells: ti.types.ndarray(dtype=ti.f32, ndim=2),
kernel_slab: ti.types.ndarray(dtype=ti.f32, ndim=3),
stimulus: ti.types.ndarray(dtype=ti.f32, ndim=3),
width: ti.int32,
height: ti.int32,
kernel_size: ti.int32,
):
"""Brute force GPU convolution.
Note that in taichi kernels, only the outermost loops are parallelized.
Want to parallelize:
- computation of piecewise products
- Summation of convolved slabs
"""
for t_idx, batch_idx in compute_cells:
for x_idx in range(width):
for y_idx in range(height):
compute_cells[t_idx, batch_idx] += (
kernel_slab[x_idx, y_idx, t_idx]
* stimulus[x_idx, y_idx, t_idx + batch_idx]
)
for batch_idx in response_container:
batch_total = 0
for t_idx in range(kernel_size):
batch_total += compute_cells[t_idx, batch_idx]
response_container[batch_idx] = batch_total
Hi Datamance, Awesome work! Can you be more specific about the fp32 precision issue? Is it somewhat related to overflow or underflow?
Should I use intermediate containers to store the results of a given block? Or is there some other pattern that I should be following?
I think you're doing a good job so far - the goal of Taichi is to make sure the parallel codes work efficiently without having to worry about these staffs. There are some more advanced patterns like ti.simt.SharedArray, but it's very trivial and I would only recommend if you're seeking for extreme performance
@jim19930609 So the exact warning was
UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
It sounds like it might be the same issue that this person was facing in another context, but I can't be sure. Here's the key part of what they wrote:
Some Googling led me to this issue, which pointed toward some shared library compiled with the gcc/clang option -ffast-math that was being loaded as the culprit. It turns out (somewhat insanely) that when -ffast-math is enabled, the compiler will link in a constructor that sets the FTZ/DAZ flags whenever the library is loaded — even on shared libraries, which means that any application that loads that library will have its floating point behavior changed for the whole process. And -Ofast, which sounds appealingly like a "make my program go fast" flag, automatically enables -ffast-math, so some projects may unwittingly turn it on without realizing the implications.
Does taichi depend on any shared libraries that might do this, or something like it?
Interesting...That was correct, taichi does use fast math by default and this is likely the reason.
But honestly that sounds more like an issue with fast math
itself
Describe the bug The smallest subnormal of the
numpy.float32
dtype is set to zero afterti.init
, causing computations to fail completely. Oddly, the warning disappears and the correct subnormal of1.4012985e-45
gets printed if youprint(np.finfo(np.float32))
before initializing taichi, but the computation still fails regardless.EDIT I see that one issue was how I was initializing the response container - so at least I get results now! I'm just wondering now if the results are incongruent due to my logic or this subnormal issue...
To Reproduce
Log/Screenshots
Additional comments Output from
ti diagnose
: