Possible context issue using pyvkfft in a multithreaded/multigpu environment

kkotyk commented 1 year ago

Hey Vince, I'm trying to write an app that delegates work to threads to perform FFTs on different gpus. Each thread manages a separate gpu and is in the basic form of:

def thread_0():
    cupy.cuda.Device(0).use()
    while True:
        get_data....
        gpu_data = cp.array(data)
        fft = fftn(gpu_data)

def thread_1():
    cupy.cuda.Device(1).use()
    while True:
        get_data....
        gpu_data = cp.array(data)
        fft = fftn(gpu_data)

def main():
    spawn_threads...
    while True:
        send_data_0_thread0(...)
        send_data_1_thread1(...)

However, pyvkfft is throwing an exception:

Traceback (most recent call last): File "/opt/python/lib/python3.9/threading.py", line 973, in _bootstrap_inner self.run() File "/opt/python/lib/python3.9/threading.py", line 910, in run self._target(*self._args, **self._kwargs) File "/opt/leolabs/radar/20230425T163443/leo-radar/radar/sparta/processing/incoherent_worker.py", line 193, in process_thread process_dict['results'] = process_incoherent.process_cpi(process_dict['mode'], process_dict['samples']) File "/opt/leolabs/radar/20230425T163443/leo-radar/radar/sparta/processing/process_incoherent.py", line 434, in process_cpi self._process_range_subset(cpi_data_gpu, tx_pulses, subset_min_range_idx, pulse_group_rising_edge, range_doppler[rstart:rend]) File "/opt/leolabs/radar/20230425T163443/leo-radar/radar/sparta/processing/process_incoherent.py", line 231, in _process_range_subset dm_pulse = dsp.fft_demod_decimate(input_data, self._padded_tx_pulse, rising_edge[pg_idx], min_range_idx, File "/opt/leolabs/radar/20230425T163443/leo-radar/radar/sparta/processing/incoherent_dsp_lib.py", line 199, in fft_demod_decimate padded_ranges = FFTBackend.fft(padded_ranges, inplace=True, axes=-1) File "/opt/leolabs/radar/20230425T163443/leo-radar/radar/sparta/utils/backends.py", line 55, in fft return pyvkfft_lib.fftn(input_data, dest=input_data, ndim=1, axes=axes) File "/opt/leolabs/radar/20230425T163443/leo-radar/venv3/lib/python3.9/site-packages/pyvkfft/fft.py", line 205, in fftn app.fft(src, dest) File "/opt/leolabs/radar/20230425T163443/leo-radar/venv3/lib/python3.9/site-packages/pyvkfft/cuda.py", line 208, in fft check_vkfft_result(res, src.shape, src.dtype, self.ndim, self.inplace, self.norm, self.r2c, File "/opt/leolabs/radar/20230425T163443/leo-radar/venv3/lib/python3.9/site-packages/pyvkfft/base.py", line 425, in check_vkfft_result raise RuntimeError("VkFFT error %d: %s %s" % (res, r.name, s)) RuntimeError: VkFFT error 4039: VKFFT_ERROR_FAILED_TO_LAUNCH_KERNEL C2C (10022,525) complex64 1D inplace norm=1 [cuda] cuLaunchKernel error: 400, 1 10022 1 - 38 1 1

From what I can this an access issue where the code is trying to access data on the wrong GPU. Is this an issue in how pyvkfft is handling context in a multigpu environment, or am I not setting something up correctly for pyvkfft? From my debugging, it looks like all my other Cupy code is respecting the device/stream context. Please let me know if any other information I can provide.

vincefn commented 1 year ago

Dear @kkotyk , could you supply a complete self-contained script which allows to reproduce the issue ?

From what I see you are using the simple fftn interface, which caches the fft plans. I would not be surprised if the caching mechanism did not manage to switch threads. If you instantiate directly the cuda.VkFFTApp in each thread I suspect that it would work.

Alternatively you can try adding the cuda_stream optional parameter to the fftn function - IIRC the streams should be different for the two devices, so the fft plan caching should produce separate plans.

(I normally only use multiprocessing -and creating the contexts inside the process- to avoid this. Though what you report does look like a bug if other cupy functions manage correctly)

kkotyk commented 1 year ago

Here is a minimal example that replicates the issue on 2 different test machines of mine:

import pyvkfft.fft as fft
import cupy as cp
import numpy as np
import threading
import queue
import time

q = queue.Queue()

def thread_fn(device_num):
    cp.cuda.Device(device_num).use()

    while True:
        data = q.get()
        gpu_data = cp.array(data)
        fft_gpu = fft.fftn(gpu_data, dest=gpu_data)
        print(f"processed data on {device_num}")

def main():

    num_devices = cp.cuda.runtime.getDeviceCount()

    threads = []
    for i in range(num_devices):
        thread = threading.Thread(target=thread_fn, args=(i,))
        thread.daemon = True
        thread.start()

    while True:
        for i in range(num_devices):
            data = np.ones(2**13)
            q.put(data)

        time.sleep(.5)

if __name__ == '__main__':
    main()

I suspect your intuition about App caching is likely the problem.

vincefn commented 1 year ago

I can confirm that taking into account the device when caching solves the issue.

Now I just need to finalise the unit tests - it's a bit messy to manipulate GPU contexts through different backends so I'll probably need to encapsulate all in separate process...

kkotyk commented 1 year ago

Thanks for looking at this so quickly. I'm not sure if it helps, but as a user I wouldn't mind if you introduced a prepare_threaded_environment(...) or similar method that could help with some of that messiness so that you don't have to auto detect or make assumptions.

vincefn commented 1 year ago

I'm not sure if it helps, but as a user I wouldn't mind if you introduced a prepare_threaded_environment(...) or similar method that could help with some of that messiness so that you don't have to auto detect or make assumptions.

In the case of cupy, this should be easily taken care of using the Device context manager

Beyond that I don't think I can provide any more than examples. Multi-GPU computing can easily be quite complicated.

kkotyk commented 1 year ago

Multi-GPU computing can easily be quite complicated.

Truth

kkotyk commented 1 year ago

Hey Vince, I wanted to checkout and test your fixes in this branch but I'm getting the following issue when I try to install with pip install .

Failed to build pyvkfft

Installing collected packages: pyvkfft

  Running setup.py install for pyvkfft ... error

  error: subprocess-exited-with-error

  × Running setup.py install for pyvkfft did not run successfully.

  │ exit code: 1

  ╰─> [69 lines of output]

      VKFFT_GIT_TAG in os.environ ? no

      ['pyvkfft-test = pyvkfft.scripts.pyvkfft_test:main', 'pyvkfft-test-suite = pyvkfft.scripts.pyvkfft_test_suite:main', 'pyvkfft-benchmark = pyvkfft.scripts.pyvkfft_benchmark:main']

      running install

      /laptop/dspenv/lib/python3.9/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.

        warnings.warn(

      running build

      running build_py

      creating build/lib.linux-x86_64-cpython-39

      creating build/lib.linux-x86_64-cpython-39/pyvkfft

      copying pyvkfft/config.py -> build/lib.linux-x86_64-cpython-39/pyvkfft

      copying pyvkfft/version.py -> build/lib.linux-x86_64-cpython-39/pyvkfft

      copying pyvkfft/benchmark.py -> build/lib.linux-x86_64-cpython-39/pyvkfft

      copying pyvkfft/__init__.py -> build/lib.linux-x86_64-cpython-39/pyvkfft

      copying pyvkfft/cuda.py -> build/lib.linux-x86_64-cpython-39/pyvkfft

      copying pyvkfft/opencl.py -> build/lib.linux-x86_64-cpython-39/pyvkfft

      copying pyvkfft/fft.py -> build/lib.linux-x86_64-cpython-39/pyvkfft

      copying pyvkfft/accuracy.py -> build/lib.linux-x86_64-cpython-39/pyvkfft

      copying pyvkfft/base.py -> build/lib.linux-x86_64-cpython-39/pyvkfft

      creating build/lib.linux-x86_64-cpython-39/pyvkfft/test

      copying pyvkfft/test/__init__.py -> build/lib.linux-x86_64-cpython-39/pyvkfft/test

      copying pyvkfft/test/test_fft.py -> build/lib.linux-x86_64-cpython-39/pyvkfft/test

      creating build/lib.linux-x86_64-cpython-39/pyvkfft/scripts

      copying pyvkfft/scripts/pyvkfft_test_suite.py -> build/lib.linux-x86_64-cpython-39/pyvkfft/scripts

      copying pyvkfft/scripts/__init__.py -> build/lib.linux-x86_64-cpython-39/pyvkfft/scripts

      copying pyvkfft/scripts/pyvkfft_test.py -> build/lib.linux-x86_64-cpython-39/pyvkfft/scripts

      copying pyvkfft/scripts/pyvkfft_benchmark.py -> build/lib.linux-x86_64-cpython-39/pyvkfft/scripts

      running egg_info

      writing pyvkfft.egg-info/PKG-INFO

      writing dependency_links to pyvkfft.egg-info/dependency_links.txt

      writing entry points to pyvkfft.egg-info/entry_points.txt

      writing requirements to pyvkfft.egg-info/requires.txt

      writing top-level names to pyvkfft.egg-info/top_level.txt

      reading manifest file 'pyvkfft.egg-info/SOURCES.txt'

      reading manifest template 'MANIFEST.in'

      warning: no files found matching 'LICENSE_VkFFT'

      warning: no files found matching 'README_VkFFT.md'

      adding license file 'LICENSE'

      writing manifest file 'pyvkfft.egg-info/SOURCES.txt'

      running build_ext

      building 'pyvkfft._vkfft_cuda' extension

      creating build/temp.linux-x86_64-cpython-39

      creating build/temp.linux-x86_64-cpython-39/src

      /usr/local/cuda/bin/nvcc -I/usr/local/cuda/include -Isrc/vkFFT -Isrc -c src/vkfft_cuda.cu -o build/temp.linux-x86_64-cpython-39/src/vkfft_cuda.o -O3 --ptxas-options=-v -std=c++11 --compiler-options=-fPIC

      src/vkFFT.h(3105): warning #550-D: variable "maxSequenceSharedMemoryPow2" was set but never used

      src/vkFFT.h(13969): warning #68-D: integer conversion resulted in a change of sign

      src/vkFFT.h(15317): warning #68-D: integer conversion resulted in a change of sign

      src/vkfft_cuda.cu(97): error: class "VkFFTConfiguration" has no member "omitDimension"

      src/vkfft_cuda.cu(98): error: class "VkFFTConfiguration" has no member "omitDimension"

      src/vkfft_cuda.cu(99): error: class "VkFFTConfiguration" has no member "omitDimension"

      src/vkfft_cuda.cu(103): error: class "VkFFTConfiguration" has no member "performDCT"

      src/vkfft_cuda.cu(115): error: class "VkFFTConfiguration" has no member "keepShaderCode"

      src/vkfft_cuda.cu(127): error: class "VkFFTConfiguration" has no member "performBandwidthBoost"

      src/vkfft_cuda.cu(138): error: class "VkFFTConfiguration" has no member "groupedBatch"

      src/vkfft_cuda.cu(139): error: class "VkFFTConfiguration" has no member "groupedBatch"

      src/vkfft_cuda.cu(140): error: class "VkFFTConfiguration" has no member "groupedBatch"

      9 errors detected in the compilation of "src/vkfft_cuda.cu".

      error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1

      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.

error: legacy-install-failure

× Encountered error while trying to install package.

╰─> pyvkfft

I am able to install a fresh release version with pip install pyvkfft. Am I missing something with my environment?

vincefn commented 1 year ago

The current git development has some changes to prepare for a reorganisation of the VkFFT headers (see #25) so I'm assuming this is the issue.

What version of VkFFT headers are you using ? What is in pyvkfft/src/ ? Right now I've switched to the develop branch of VkFFT and in pyvkfft/src there is a symbolic link from pyvkfft/src/vkFFT to vkfft/vkFFT. But it should also work if you still have the old vkfft single-file header.

I think I may use a git submodule to make this simpler.

vincefn commented 1 year ago

Hi @kkotyk, I have just merged a change so that VkFFT will be automatically used as a git submodule, which should be easier to install (I suggest re-checking out pyvkfft, otherwise youmay have to manually init the VkFFT submodule).

kkotyk commented 1 year ago

That submodule fix works great. I was easily able to install after that. I ran into another issue trying to run the minimal example I linked before:

Exception in thread Thread-1:

Traceback (most recent call last):

  File "/usr/lib64/python3.9/threading.py", line 980, in _bootstrap_inner

    self.run()

  File "/usr/lib64/python3.9/threading.py", line 917, in run

    self._target(*self._args, **self._kwargs)

  File "/laptop/git/leo-radar/radar/sparta/processing/extras/test_multi_gpu.py", line 16, in thread_fn

Exception in thread Thread-2:

Traceback (most recent call last):

  File "/usr/lib64/python3.9/threading.py", line 980, in _bootstrap_inner

    self.run()

  File "/usr/lib64/python3.9/threading.py", line 917, in run

    self._target(*self._args, **self._kwargs)

  File "/laptop/git/leo-radar/radar/sparta/processing/extras/test_multi_gpu.py", line 16, in thread_fn

    fft_gpu = fft.fftn(gpu_data, gpu_data)

  File "/laptop/dspenv/lib64/python3.9/site-packages/pyvkfft/fft.py", line 214, in fftn

    fft_gpu = fft.fftn(gpu_data, gpu_data)

  File "/laptop/dspenv/lib64/python3.9/site-packages/pyvkfft/fft.py", line 214, in fftn

    app = _get_fft_app(backend, src.shape, src.dtype, inplace, ndim, axes, norm, cuda_stream, cl_queue, devctx,

TypeError: unhashable type: 'Stream'

    app = _get_fft_app(backend, src.shape, src.dtype, inplace, ndim, axes, norm, cuda_stream, cl_queue, devctx,

TypeError: unhashable type: 'Stream'

My guess is this is typing issue thrown by the lru_cache lib you are using here.

vincefn commented 1 year ago

The minimal example you gave runs fine as far as I can see - I just tested on linux with cupy_cuda11x-12.0.0 and python 3.9.

What system are you using ?

vincefn commented 1 year ago

@kkotyk I have changed the way the cuda stream is used as arguments, so the lru_cache should hopefully work for you. I'm still curious as to why it failed for you and not for me.

kkotyk commented 1 year ago

Hey Vince, your new changes work for me! I'm not sure what the issue was but this works now!

edit: I'm using cupy-cuda116==10.5.0 and python 3.9

vincefn / pyvkfft

Possible context issue using pyvkfft in a multithreaded/multigpu environment #26