silx-kit / pyFAI

Fast Azimuthal Integration in Python
Other
106 stars 95 forks source link

GPU Acceleration on Newest Version Not Working #2339

Open DarioLewczyk opened 1 day ago

DarioLewczyk commented 1 day ago

I have been trying to update a piece of code I wrote to run on an M1 Max MacBook Pro with 64 GB of RAM to make use of the gpu. In a previous version of pyFAI, this was accomplished easily using the method: ('full', 'csr', 'opencl', 'gpu'). Now, that method does not work. I tried going through the tutorial on the pyfai documentation (https://pyfai.readthedocs.io/en/stable/usage/tutorial/Parallelization/GPU-decompression.html) and the issue remains.

import sys, os, collections, struct, time
import numpy, pyFAI
import h5py, hdf5plugin
from matplotlib.pyplot import subplots
import bitshuffle
import pyopencl.array as cla
import silx
from silx.opencl import ocl
from silx.opencl.codec.bitshuffle_lz4 import BitshuffleLz4
start_time = time.time()
ocl
# Here we want to select the open CL device. 
target = (0,1)
det = pyFAI.detector_factory("eiger_4M")
shape = det.shape
dtype = numpy.dtype("uint32")
filename = "/tmp/big.h5"
nbins = 1000
cmp = hdf5plugin.Bitshuffle()
hdf5plugin.get_config().build_config
mem_bytes = os.sysconf('SC_PAGE_SIZE') * os.sysconf('SC_PHYS_PAGES')
print(f"Number of frames the computer can host in memory: {mem_bytes/(numpy.prod(shape)*dtype.itemsize):.3f}")
if os.environ.get('SLURM_MEM_PER_NODE'):
    print(f"Number of frames the computer can host in memory with SLURM restrictions: {int(os.environ['SLURM_MEM_PER_NODE'])*(1<<20)/(numpy.prod(shape)*dtype.itemsize):.3f}")
#The computer being limited to 64G of RAM, the number of frames actually possible is 3800.
nbframes = 4096 # slightly larger than the maximum achievable ! Such a dataset should not host in memory.

#Prepare a frame with little count so that it compresses well
geo = {"detector": det,
       "wavelength": 1e-10,
       "rot3":0} #work around a bug https://github.com/silx-kit/pyFAI/pull/1749
ai = pyFAI.load(geo)
omega = ai.solidAngleArray()
q = numpy.arange(15)
img = ai.calcfrom1d(q, 100/(1+q*q))
frame = numpy.random.poisson(img).astype(dtype)

# display the image
fig,ax = subplots()
ax.imshow(frame)
print("Performances of the different algorithms for azimuthal integration of Eiger 4M image on the CPU")
for algo in ("histogram", "csc", "csr"):
    print(f"Using algorithm {algo:10s}:", end=" ")
    %timeit ai.integrate1d(img, nbins, method=("full", algo, "cython"))
print("Performances of the different algorithms for azimuthal integration of Eiger 4M image on the GPU")
print(f"Using algorithm {algo:10s}:", end=" ")
%timeit ai.integrate1d(img, nbins, method=("full", algo, "opencl", target))

When I run this code, it fails when it uses the method: ('full', 'csr', 'opencl', target).

The warning thrown is: WARNING:pyFAI.azimuthalIntegrator:MemoryError: falling back on default forward implementation

pyFAI can't recover from this and it seems to be stemming from not being able to reset the engine. It just gets stuck in an infinite wait loop.

kif commented 1 day ago

Hi Dario,

Your message puzzles me somehow because I was not aware that pyFAI was actually running on Apple-silicon GPU. CPU should be OK, and I tested, but not GPU since Apple enforces the use of Metal which later on requires a MoltenVK... I did not manage to get it working but I did not try really hard neither. Could you please send me the output of clinfo and clpeak ? At lease for my personal record.

There is something else that puzzles me even further: MemoryError: you did run out of memory while initializing the processing. Of course, pyFAI eats more and more memory on the GPU since it allocate all memory buffers at initialization and new (advanced) methods require extra buffer. Nevertheless, I am able to run all test and even pyFAI-benchmark -c -g on a computer with 4GB of video RAM (which is fairly low by today's standards). This also means the initialization can fail if there are other programs using the GPU memory. One classical example are deep-learning framework like pytorch or tensorflow which allocate all GPU memory by default, letting nothing for other programs like pyFAI.

Sorry for answering to your message with so many questions ...

Jerome

vincefn commented 14 hours ago

FYI the clinfo on a macbook M3 Pro:

(pyobjcryst) favre@macfavre2-wifi braggptycho % clinfo
Number of platforms                               1
  Platform Name                                   Apple
  Platform Vendor                                 Apple
  Platform Version                                OpenCL 1.2 (Jul 19 2024 22:07:05)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event

  Platform Name                                   Apple
Number of devices                                 1
  Device Name                                     Apple M3 Pro
  Device Vendor                                   Apple
  Device Vendor ID                                0x1027f00
  Device Version                                  OpenCL 1.2 
  Driver Version                                  1.2 1.0
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               14
  Max clock frequency                             1000MHz
  Device Partition                                (core)
    Max number of sub-devices                     0
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             256x256x256
  Max work group size                             256
  Preferred work group size multiple (kernel)     32
  Preferred / native vector sizes                 
    char                                                 1 / 1       
    short                                                1 / 1       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 0 / 0        (n/a)
    float                                                1 / 1       
    double                                               1 / 1        (n/a)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     No
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (n/a)
  Address bits                                    64, Little-Endian
  Global memory size                              12884918272 (12GiB)
  Error Correction support                        No
  Max memory allocation                           2415919104 (2.25GiB)
  Unified memory for Host and Device              Yes
  Minimum alignment for any data type             1 bytes
  Alignment of base address                       32768 bits (4096 bytes)
  Global Memory cache type                        None
  Image support                                   Yes
    Max number of samplers per kernel             32
    Max size for 1D images from buffer            268435456 pixels
    Max 1D or 2D image array size                 2048 images
    Base address alignment for 2D image buffers   256 bytes
    Pitch alignment for 2D image buffers          256 pixels
    Max 2D image size                             16384x16384 pixels
    Max 3D image size                             2048x2048x2048 pixels
    Max number of read image args                 128
    Max number of write image args                8
  Local memory type                               Local
  Local memory size                               32768 (32KiB)
  Max number of constant args                     31
  Max constant buffer size                        1073741824 (1024MiB)
  Max size of kernel argument                     4096 (4KiB)
  Queue properties                                
    Out-of-order execution                        No
    Profiling                                     Yes
  Prefer user sync for interop                    Yes
  Profiling timer resolution                      1000ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                (n/a)
  Device Extensions                               cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images 

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
  clCreateContext(NULL, ...) [default]            No platform
  clCreateContext(NULL, ...) [other]              Success [P0]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 Apple
    Device Name                                   Apple M3 Pro
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 Apple
    Device Name                                   Apple M3 Pro
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  Invalid device type for platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 Apple
    Device Name                                   Apple M3 Pro

ICD loader properties
  ICD loader Name                                 Khronos OpenCL ICD Loader
  ICD loader Vendor                               Khronos Group
  ICD loader Version                              3.0.6
  ICD loader Profile                              OpenCL 3.0
(pyobjcryst) favre@macfavre2-wifi braggptycho % clpeak

Platform: Apple
  Device: Apple M3 Pro
    Driver version  : 1.2 1.0 (Macintosh)
    Compute units   : 14
    Clock frequency : 1000 MHz

    Global memory bandwidth (GBPS)
      float   : 130.77
      float2  : 135.06
      float4  : 136.57
      float8  : 137.66
      float16 : 133.59

    Single-precision compute (GFLOPS)
      float   : 2429.10
      float2  : 2431.20
      float4  : 2451.80
      float8  : 2453.77
      float16 : 2444.62

    No half precision support! Skipped

    No double precision support! Skipped

    Integer compute (GIOPS)
      int   : 1232.02
      int2  : 1233.54
      int4  : 1233.38
      int8  : 1233.25
      int16 : 1232.63

    Integer compute Fast 24bit (GIOPS)
      int   : 844.44
      int2  : 1011.48
      int4  : 1012.71
      int8  : 1011.63
      int16 : 1010.51

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer              : 27.99
      enqueueReadBuffer               : 27.47
      enqueueWriteBuffer non-blocking : 57.47
      enqueueReadBuffer non-blocking  : 57.05
      enqueueMapBuffer(for read)      : 754974.75
        memcpy from mapped ptr        : 33.41
      enqueueUnmap(after write)       : 159994.64
        memcpy to mapped ptr          : 43.02

    Kernel launch latency : 0.68 us
kif commented 13 hours ago

So basically, it confirms the bug ...

kif commented 10 hours ago

Some tests passes, some don't on Apple silicon GPU:

Passes:

Broken

Since preproc is used by all other kernels ... makes sense to investigate there

kif commented 10 hours ago

Apparently the bug is related to the treatment of double precision floating point values in preproc which prevents the code from compiling.

DarioLewczyk commented 7 hours ago

Hi Jerome,

here are the outputs of the clinfo and clpeak.

Number of platforms                               1
  Platform Name                                   Apple
  Platform Vendor                                 Apple
  Platform Version                                OpenCL 1.2 (Nov  2 2024 12:00:13)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event

  Platform Name                                   Apple
Number of devices                                 1
  Device Name                                     Apple M1 Max
  Device Vendor                                   Apple
  Device Vendor ID                                0x1027f00
  Device Version                                  OpenCL 1.2 
  Driver Version                                  1.2 1.0
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               32
  Max clock frequency                             1000MHz
  Device Partition                                (core)
    Max number of sub-devices                     0
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             256x256x256
  Max work group size                             256
  Preferred work group size multiple (kernel)     32
  Preferred / native vector sizes                 
    char                                                 1 / 1       
    short                                                1 / 1       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 0 / 0        (n/a)
    float                                                1 / 1       
    double                                               1 / 1        (n/a)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     No
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (n/a)
  Address bits                                    64, Little-Endian
  Global memory size                              51539607552 (48GiB)
  Error Correction support                        No
  Max memory allocation                           9663676416 (9GiB)
  Unified memory for Host and Device              Yes
  Minimum alignment for any data type             1 bytes
  Alignment of base address                       32768 bits (4096 bytes)
  Global Memory cache type                        None
  Image support                                   Yes
    Max number of samplers per kernel             32
    Max size for 1D images from buffer            268435456 pixels
    Max 1D or 2D image array size                 2048 images
    Base address alignment for 2D image buffers   256 bytes
    Pitch alignment for 2D image buffers          256 pixels
    Max 2D image size                             16384x16384 pixels
    Max 3D image size                             2048x2048x2048 pixels
    Max number of read image args                 128
    Max number of write image args                8
  Local memory type                               Local
  Local memory size                               32768 (32KiB)
  Max number of constant args                     31
  Max constant buffer size                        1073741824 (1024MiB)
  Max size of kernel argument                     4096 (4KiB)
  Queue properties                                
    Out-of-order execution                        No
    Profiling                                     Yes
  Prefer user sync for interop                    Yes
  Profiling timer resolution                      1000ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                (n/a)
  Device Extensions                               cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images 

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  Apple
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [P0]
  clCreateContext(NULL, ...) [default]            Success [P0]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 Apple
    Device Name                                   Apple M1 Max
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 Apple
    Device Name                                   Apple M1 Max
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  Invalid device type for platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 Apple
    Device Name                                   Apple M1 Max
Platform: Apple
  Device: Apple M1 Max
    Driver version  : 1.2 1.0 (Macintosh)
    Compute units   : 32
    Clock frequency : 1000 MHz

    Global memory bandwidth (GBPS)
      float   : 354.84
      float2  : 362.64
      float4  : 357.37
      float8  : 343.09
      float16 : 200.69

    Single-precision compute (GFLOPS)
      float   : 4805.67
      float2  : 4894.02
      float4  : 4920.20
      float8  : 4096.86
      float16 : 5149.60

    No half precision support! Skipped

    No double precision support! Skipped

    Integer compute (GIOPS)
      int   : 2638.85
      int2  : 2633.48
      int4  : 2637.61
      int8  : 2623.68
      int16 : 2640.54

    Integer compute Fast 24bit (GIOPS)
      int   : 1230.24
      int2  : 1278.39
      int4  : 1275.81
      int8  : 1276.06
      int16 : 1266.99

    Integer char (8bit) compute (GIOPS)
      char   : 2639.90
      char2  : 2640.65
      char4  : 2639.65
      char8  : 2637.88
      char16 : 2638.85

    Integer short (16bit) compute (GIOPS)
      short   : 2639.52
      short2  : 2639.68
      short4  : 2639.60
      short8  : 2638.01
      short16 : 2639.59

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer              : 48.52
      enqueueReadBuffer               : 46.82
      enqueueWriteBuffer non-blocking : 66.28
      enqueueReadBuffer non-blocking  : 62.25
      enqueueMapBuffer(for read)      : 1227133.62
        memcpy from mapped ptr        : 34.09
      enqueueUnmap(after write)       : 421075.22
        memcpy to mapped ptr          : 37.56

    Kernel launch latency : 2.49 us

– As for the question of if the computer ran out of memory, no. I had about 40 GB of RAM left. So I really don't understand the memory error either.

kif commented 2 hours ago

Hi Dario,

Thanks for the feed-back. Indeed, the computer has apparently enough resources.

I managed to reproduce the bug with the help of VinceFN. I believe it is linked to a bug in the Apple OpenCL compiler. No need to report it to Apple: they would just suggest to migrate to Metal. This is unlikely to occur, rather the opposite since the primary platform for pyFAI is Linux as requested by my employer, not MacOS which is barely accepted at ESRF.

Double precision (aka fp64) is not a first class citizen on GPU, and was never supported on Apple hardware, but at the intel CPU time, the compiler was actually reporting properly it was not usable. On Apple Silicon hardware, the host code has the info that OpenCL has not support for fp64 but inside the compiler, the variable is set as if it was available. In pyFAI, there is a bit of fp64 but this is removed if the hardware does not support it.

As a consequence, the compilation fails when encountering the double precision section... of course, without clear error message.

This pull request should fix this issue ... https://github.com/silx-kit/pyFAI/pull/2341

Could you please validate try it ? Thanks