Open DarioLewczyk opened 1 day ago
Hi Dario,
Your message puzzles me somehow because I was not aware that pyFAI was actually running on Apple-silicon GPU.
CPU should be OK, and I tested, but not GPU since Apple enforces the use of Metal which later on requires a MoltenVK...
I did not manage to get it working but I did not try really hard neither.
Could you please send me the output of clinfo
and clpeak
?
At lease for my personal record.
There is something else that puzzles me even further: MemoryError
: you did run out of memory while initializing the processing. Of course, pyFAI eats more and more memory on the GPU since it allocate all memory buffers at initialization and new (advanced) methods require extra buffer. Nevertheless, I am able to run all test and even pyFAI-benchmark -c -g
on a computer with 4GB of video RAM (which is fairly low by today's standards). This also means the initialization can fail if there are other programs using the GPU memory. One classical example are deep-learning framework like pytorch
or tensorflow
which allocate all GPU memory by default, letting nothing for other programs like pyFAI.
Sorry for answering to your message with so many questions ...
Jerome
FYI the clinfo on a macbook M3 Pro:
(pyobjcryst) favre@macfavre2-wifi braggptycho % clinfo
Number of platforms 1
Platform Name Apple
Platform Vendor Apple
Platform Version OpenCL 1.2 (Jul 19 2024 22:07:05)
Platform Profile FULL_PROFILE
Platform Extensions cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event
Platform Name Apple
Number of devices 1
Device Name Apple M3 Pro
Device Vendor Apple
Device Vendor ID 0x1027f00
Device Version OpenCL 1.2
Driver Version 1.2 1.0
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 14
Max clock frequency 1000MHz
Device Partition (core)
Max number of sub-devices 0
Supported partition types None
Supported affinity domains (n/a)
Max work item dimensions 3
Max work item sizes 256x256x256
Max work group size 256
Preferred work group size multiple (kernel) 32
Preferred / native vector sizes
char 1 / 1
short 1 / 1
int 1 / 1
long 1 / 1
half 0 / 0 (n/a)
float 1 / 1
double 1 / 1 (n/a)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (n/a)
Address bits 64, Little-Endian
Global memory size 12884918272 (12GiB)
Error Correction support No
Max memory allocation 2415919104 (2.25GiB)
Unified memory for Host and Device Yes
Minimum alignment for any data type 1 bytes
Alignment of base address 32768 bits (4096 bytes)
Global Memory cache type None
Image support Yes
Max number of samplers per kernel 32
Max size for 1D images from buffer 268435456 pixels
Max 1D or 2D image array size 2048 images
Base address alignment for 2D image buffers 256 bytes
Pitch alignment for 2D image buffers 256 pixels
Max 2D image size 16384x16384 pixels
Max 3D image size 2048x2048x2048 pixels
Max number of read image args 128
Max number of write image args 8
Local memory type Local
Local memory size 32768 (32KiB)
Max number of constant args 31
Max constant buffer size 1073741824 (1024MiB)
Max size of kernel argument 4096 (4KiB)
Queue properties
Out-of-order execution No
Profiling Yes
Prefer user sync for interop Yes
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
printf() buffer size 1048576 (1024KiB)
Built-in kernels (n/a)
Device Extensions cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform
clCreateContext(NULL, ...) [default] No platform
clCreateContext(NULL, ...) [other] Success [P0]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)
Platform Name Apple
Device Name Apple M3 Pro
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1)
Platform Name Apple
Device Name Apple M3 Pro
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) Invalid device type for platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1)
Platform Name Apple
Device Name Apple M3 Pro
ICD loader properties
ICD loader Name Khronos OpenCL ICD Loader
ICD loader Vendor Khronos Group
ICD loader Version 3.0.6
ICD loader Profile OpenCL 3.0
(pyobjcryst) favre@macfavre2-wifi braggptycho % clpeak
Platform: Apple
Device: Apple M3 Pro
Driver version : 1.2 1.0 (Macintosh)
Compute units : 14
Clock frequency : 1000 MHz
Global memory bandwidth (GBPS)
float : 130.77
float2 : 135.06
float4 : 136.57
float8 : 137.66
float16 : 133.59
Single-precision compute (GFLOPS)
float : 2429.10
float2 : 2431.20
float4 : 2451.80
float8 : 2453.77
float16 : 2444.62
No half precision support! Skipped
No double precision support! Skipped
Integer compute (GIOPS)
int : 1232.02
int2 : 1233.54
int4 : 1233.38
int8 : 1233.25
int16 : 1232.63
Integer compute Fast 24bit (GIOPS)
int : 844.44
int2 : 1011.48
int4 : 1012.71
int8 : 1011.63
int16 : 1010.51
Transfer bandwidth (GBPS)
enqueueWriteBuffer : 27.99
enqueueReadBuffer : 27.47
enqueueWriteBuffer non-blocking : 57.47
enqueueReadBuffer non-blocking : 57.05
enqueueMapBuffer(for read) : 754974.75
memcpy from mapped ptr : 33.41
enqueueUnmap(after write) : 159994.64
memcpy to mapped ptr : 43.02
Kernel launch latency : 0.68 us
So basically, it confirms the bug ...
Some tests passes, some don't on Apple silicon GPU:
Since preproc is used by all other kernels ... makes sense to investigate there
Apparently the bug is related to the treatment of double
precision floating point values in preproc
which prevents the code from compiling.
Hi Jerome,
here are the outputs of the clinfo and clpeak.
Number of platforms 1
Platform Name Apple
Platform Vendor Apple
Platform Version OpenCL 1.2 (Nov 2 2024 12:00:13)
Platform Profile FULL_PROFILE
Platform Extensions cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event
Platform Name Apple
Number of devices 1
Device Name Apple M1 Max
Device Vendor Apple
Device Vendor ID 0x1027f00
Device Version OpenCL 1.2
Driver Version 1.2 1.0
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 32
Max clock frequency 1000MHz
Device Partition (core)
Max number of sub-devices 0
Supported partition types None
Supported affinity domains (n/a)
Max work item dimensions 3
Max work item sizes 256x256x256
Max work group size 256
Preferred work group size multiple (kernel) 32
Preferred / native vector sizes
char 1 / 1
short 1 / 1
int 1 / 1
long 1 / 1
half 0 / 0 (n/a)
float 1 / 1
double 1 / 1 (n/a)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (n/a)
Address bits 64, Little-Endian
Global memory size 51539607552 (48GiB)
Error Correction support No
Max memory allocation 9663676416 (9GiB)
Unified memory for Host and Device Yes
Minimum alignment for any data type 1 bytes
Alignment of base address 32768 bits (4096 bytes)
Global Memory cache type None
Image support Yes
Max number of samplers per kernel 32
Max size for 1D images from buffer 268435456 pixels
Max 1D or 2D image array size 2048 images
Base address alignment for 2D image buffers 256 bytes
Pitch alignment for 2D image buffers 256 pixels
Max 2D image size 16384x16384 pixels
Max 3D image size 2048x2048x2048 pixels
Max number of read image args 128
Max number of write image args 8
Local memory type Local
Local memory size 32768 (32KiB)
Max number of constant args 31
Max constant buffer size 1073741824 (1024MiB)
Max size of kernel argument 4096 (4KiB)
Queue properties
Out-of-order execution No
Profiling Yes
Prefer user sync for interop Yes
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
printf() buffer size 1048576 (1024KiB)
Built-in kernels (n/a)
Device Extensions cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) Apple
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [P0]
clCreateContext(NULL, ...) [default] Success [P0]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)
Platform Name Apple
Device Name Apple M1 Max
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1)
Platform Name Apple
Device Name Apple M1 Max
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) Invalid device type for platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1)
Platform Name Apple
Device Name Apple M1 Max
Platform: Apple
Device: Apple M1 Max
Driver version : 1.2 1.0 (Macintosh)
Compute units : 32
Clock frequency : 1000 MHz
Global memory bandwidth (GBPS)
float : 354.84
float2 : 362.64
float4 : 357.37
float8 : 343.09
float16 : 200.69
Single-precision compute (GFLOPS)
float : 4805.67
float2 : 4894.02
float4 : 4920.20
float8 : 4096.86
float16 : 5149.60
No half precision support! Skipped
No double precision support! Skipped
Integer compute (GIOPS)
int : 2638.85
int2 : 2633.48
int4 : 2637.61
int8 : 2623.68
int16 : 2640.54
Integer compute Fast 24bit (GIOPS)
int : 1230.24
int2 : 1278.39
int4 : 1275.81
int8 : 1276.06
int16 : 1266.99
Integer char (8bit) compute (GIOPS)
char : 2639.90
char2 : 2640.65
char4 : 2639.65
char8 : 2637.88
char16 : 2638.85
Integer short (16bit) compute (GIOPS)
short : 2639.52
short2 : 2639.68
short4 : 2639.60
short8 : 2638.01
short16 : 2639.59
Transfer bandwidth (GBPS)
enqueueWriteBuffer : 48.52
enqueueReadBuffer : 46.82
enqueueWriteBuffer non-blocking : 66.28
enqueueReadBuffer non-blocking : 62.25
enqueueMapBuffer(for read) : 1227133.62
memcpy from mapped ptr : 34.09
enqueueUnmap(after write) : 421075.22
memcpy to mapped ptr : 37.56
Kernel launch latency : 2.49 us
– As for the question of if the computer ran out of memory, no. I had about 40 GB of RAM left. So I really don't understand the memory error either.
Hi Dario,
Thanks for the feed-back. Indeed, the computer has apparently enough resources.
I managed to reproduce the bug with the help of VinceFN. I believe it is linked to a bug in the Apple OpenCL compiler. No need to report it to Apple: they would just suggest to migrate to Metal. This is unlikely to occur, rather the opposite since the primary platform for pyFAI is Linux as requested by my employer, not MacOS which is barely accepted at ESRF.
Double precision (aka fp64) is not a first class citizen on GPU, and was never supported on Apple hardware, but at the intel CPU time, the compiler was actually reporting properly it was not usable. On Apple Silicon hardware, the host code has the info that OpenCL has not support for fp64 but inside the compiler, the variable is set as if it was available. In pyFAI, there is a bit of fp64 but this is removed if the hardware does not support it.
As a consequence, the compilation fails when encountering the double precision section... of course, without clear error message.
This pull request should fix this issue ... https://github.com/silx-kit/pyFAI/pull/2341
Could you please validate try it ? Thanks
I have been trying to update a piece of code I wrote to run on an M1 Max MacBook Pro with 64 GB of RAM to make use of the gpu. In a previous version of pyFAI, this was accomplished easily using the method: ('full', 'csr', 'opencl', 'gpu'). Now, that method does not work. I tried going through the tutorial on the pyfai documentation (https://pyfai.readthedocs.io/en/stable/usage/tutorial/Parallelization/GPU-decompression.html) and the issue remains.
When I run this code, it fails when it uses the method: ('full', 'csr', 'opencl', target).
The warning thrown is: WARNING:pyFAI.azimuthalIntegrator:MemoryError: falling back on default forward implementation
pyFAI can't recover from this and it seems to be stemming from not being able to reset the engine. It just gets stuck in an infinite wait loop.