vincefn / pyvkfft

Python interface to VkFFT
MIT License
51 stars 6 forks source link

VkFFT 1.3 #25

Closed DTolm closed 1 year ago

DTolm commented 1 year ago

Dear @vincefn,

I have made some substantial changes to VkFFT in version 1.3 (https://github.com/DTolm/VkFFT/tree/develop), so there will be a two-month period before it is merged in the main branch for people to adjust the dependent projects. Namely, VkFFT is no longer a header-only file, but rather a collection of headers. This should not increase the complexity of usage - you still link to vkFFT.h file and it includes the other files. The main advantage is that the code base now is more structured and way easier to understand by other developers.

I have tested the code on some systems with the implemented benchmark scripts - RTX2080 (Vulkan, CUDA, OpenCL), MI250 (HIP), A100 (CUDA), UHD610 (Level Zero) and M1 Pro (Metal), however, your suite is more thorough in this regard. Also, you might be interested in exploring the new design.

I suppose keeping an issue for this period can be helpful for discussion.

Best regards, Dmitrii

vincefn commented 1 year ago

Hi @DTolm , thanks for the heads-up, I'll try to adapt pyvkfft in a few days. (PS: I was actually going to mail you because I found a performance regression on V100 (but not only) for some sizes, e.g. c2c 512**3 3D transform - almost 2x slower than in version 1.2.21)

vincefn commented 1 year ago

Dear @DTolm (Cc @picca)

Maybe one quick comment regarding the organisation of the headers: splitting the different parts is a good thing, itwill be more usable that way. However I do not know how you intend to have the headers distributed and used. Previoulsy, this just required adding one path and using #include "vkFFT.h". I see that for Debian the approach is similar given the files distributed (https://packages.debian.org/sid/all/libvkfft-dev/filelist).

Now you are using #include "vkFFT_Structs/vkFFT_Structs.h", which means that you expect a directory organization like:

INCLUDE_DIR/vkFFT.h
INCLUDE_DIR/vkFFT_AppManagement/vkFFTDeleteApp.h
INCLUDE_DIR/vkFFT_AppManagement/...
INCLUDE_DIR/vkFFT_CodeGen/Kernelsvel0/vkFFT_KernUtils.h
INCLUDE_DIR/vkFFT_CodeGen/...
INCLUDE_DIR/vkFFT_PlanManagement/...
INCLUDE_DIR/vkFFT_Structs/...

where INCLUDE_DIR would be e.g. /usr/include under linux.

This is a little unusual as it requires copying not one directory (vkFFT) with all the headers, but instead copying both vkFFT.h and the subdirectories vkFFT_AppManagement, vkFFT_CodeGen, vkFFT_PlanManagement and vkFFT_Structs into the distribution's include directory..

This is just a suggestion, but there would be two simpler ways of doing this: 1) do not change your directory structure, but include from one level up, i.e. use:

This has the advantage of only requiring a single copy or link of the vkFFT include folder.

2) simplify a bit more and use

INCLUDE_DIR/vkFFT.h
INCLUDE_DIR/vkFFT/AppManagement/DeleteApp.h
INCLUDE_DIR/vkFFT/AppManagement/...
INCLUDE_DIR/vkFFT/CodeGen/Kernelsvel0/KernUtils.h
INCLUDE_DIR/vkFFT/CodeGen/...
INCLUDE_DIR/vkFFT/PlanManagement/...
INCLUDE_DIR/vkFFT/Structs/...

I think the latter is a much more standard way of organising the include directories - and more convenient, with one top include header and one directory.

I've also put @picca in Cc who maintains the VkFFT Debian package for further advice.

DTolm commented 1 year ago

Dear @vincefn,

I am happy to change the codebase directory layout to be more in line with other header-only libraries - so probably the second approach is the easiest?

As for the performance regressions - it is likely related to experiments with the dispatch threadblock size. They can improve some systems while making other systems worse and it is hard to decide the best size automatically for all systems. Currently, this decision-maker is located on lines 447-773 in vkFFT_Plan_FFT file and should probably be moved to a separate file as well. I will check what happens with 512^3 system and try to make some improvements.

Best regards, Dmitrii

vincefn commented 1 year ago

I am happy to change the codebase directory layout to be more in line with other header-only libraries - so probably the second approach is the easiest?

I think this would be best. But please get another opinion before committing - I don't know if @picca has an opinion on this.

As for the performance regressions - it is likely related to experiments with the dispatch threadblock size. They can improve some systems while making other systems worse and it is hard to decide the best size automatically for all systems. Currently, this decision-maker is located on lines 447-773 in vkFFT_Plan_FFT file and should probably be moved to a separate file as well. I will check what happens with 512^3 system and try to make some improvements.

Could it make sense to make this tunable - changing the threadblock size using an optional parameter ? You could imagine using this like in FFTW with their FFTW_ESTIMATE and FFTW_MEASURE - it would be the task of the calling library (e.g. pyvkfft) to test the performance of different sizes. I don't know how deep this is in the code and so if that would be possible.

Maybe it's already tunable through e.g. maxThreadNum - let me know.

vincefn commented 1 year ago

Regarding the speed issue, here are some comparison benchmarks:

The A2000 has no issues in 2D (at least up to 1024) but a decrease in 3D: (below the graphs up to 1024 are 2D, up to 660 this is 3D transforms - all batched so the actual array transformed are about 1GB)

pyvkfft-benchmark-A2000-2D pyvkfft-benchmark-A2000-3D

For Titan V and V100 the decrease is clearer: pyvkfft-benchmark-TitanV-2D pyvkfft-benchmark-TitanV-3D

pyvkfft-benchmark-V100-2D pyvkfft-benchmark-V100-3D

DTolm commented 1 year ago

That's unexpected, I will investigate the reason today and add the update to the 1.3.0 branch. Thanks for finding it out!

picca commented 1 year ago

Hello, just about the include files. I prefer the 2nd solution.

you have to decide for your user what is the prefered include directive

#include <vkFFT.h>

Then you decide where the includes files are installed by default. (for example with the autotools, it is under the includedir path which is by default $DESTDIR/usr/include/, but you can also decide to intall under a versioned version of your library

something like

/usr/include/vkFFT-'X'/
/usr/include/vkFFT-'X'/vkFFt.h
/usr/include/vkFFT-'X'/vkFFT/...

This is the way gtk libraries are installed. Usually a library comes with a pkg-config files which allows to find the -I{dir} where the includes files are expected. This is particularly important if you select the versionned directory structure, because this directory is not part of the default include search paths.

then using pkg-config we have

pkg-config --cflags vkFFT
-I/usr/include/vkFFT-X/

pkg-config --libs vkFFT
<nothing> this is a pure header library

The install script should be compatible with this pkg-config files generated during the build.

I am not a specialist of cmake, but I think that this is identical.

for now I would keep the 2nd solution proposed by Vincent, since using two version of the library at the same time is not something expected (I think)

do not hesitate to ask further question if I was not clear enough.

Frederic

vincefn commented 1 year ago

Hi @picca - thanks for the feedback. I don't think versioning would be needed for VkFFT, so I guess solution 2) above would be better - and it does not change anything to the end user, just using #include <vkFFT.h> would still work.

vincefn commented 1 year ago

Hi @DTolm - here are longer 2D benchmarks. The decrease in performance is largely localised in the length<1024 region:

Titan V: pyvkfft-benchmark-TitanV-2D For the V100 - I'm suprised at the spread compared to the Titan V: pyvkfft-benchmark-V100-2D

The A2000 throughput is remarkably stable over a wide range: pyvkfft-benchmark-A2000-2D

DTolm commented 1 year ago

@picca Sounds good, I will switch to the solution 2 without versioning in one of the next develop branch updates.

@vincefn I have identified the issue with the 512^3 system - it is again related to distant coalesced memory accesses in z axis FFT. I changed the logic for it between 1.2.21 and 1.2.33 and it stopped coalescing as much as possible, which is a good approach to this problem. I will need to rethink the logic once again and maybe make a small autotuner for this.

as for the Titan V results, systems < 1024^2 take <8MB and are really dependent on L2 cache and can be greatly affected by background tasks. I couldn't verify the discrepancy on RTX2080. The chip of titan v is essentially the same as v100 with some sm disabled, so I am not sure why there is such a difference (VkFFT surely produces the same binaries for them). I will try and investigate the drop between 1024 and 1536, which is also happening when the sm uses between 64 and 96kb of shared memory for y axis.

The A2000 has such a low bandwidth for the chip so it is just never compute limited (the Ampere architecture also had a swap to having fp32/int cores merged, which greatly helps in case of FFTs).

vincefn commented 1 year ago

Hi @DTolm, I have begun playing around to advanced VkFFT parameters to see if this could easily be used to bake more optimal plans.

There is now a pyvkfft-benchmark script which can be used to test those relatively easily.

Some preliminary interesting results (only using radix 2 & 3):

On my macbook's M1 - with OpenCL, lowering aimthreads to 32 (maybe 64 would be enough, 16 gives similar results) gives improvements in 2D and 3D: pyvkfft-benchmark-M1-2D-threads pyvkfft-benchmark-M1-3D-threads

On a Titan V (using CUDA), increasing coalescedMemory to 128 or 256 (for 3D) gives very nice improvements: pyvkfft-benchmark-TitanV-2D-coalmem

pyvkfft-benchmark-TitanV-3D-coalmem

And finally for a V100 (also CUDA) also tweaking coalescedMemory: pyvkfft-benchmark-V100-2D-coalmem pyvkfft-benchmark-V100-3D-coalmem

I like this very much - as you said finding optimal parameters can be very tricky given the diversity of GPU configurations, so searching for the best options could easily be done by a higher-level library like pyvkfft, if the user chooses to do so (like the FFTW_MEASURE approach). In my case where I use iterative algorithms this can make a lot of difference.

DTolm commented 1 year ago

@vincefn This study is also very interesting from an understanding of the GPU perspective. I have added access to batching parameter - groupedBatch. Batching here means how many FFTs are done per single kernel by a dedicated threadblock. It will try to force the user value for batching for each of the three axes. It has an effect on how many threads are allocated, read/write pattern, coalescing, etc. I also attach a simple search routine that looks for an optimal parameter on a discrete grid of batches - it is slow and unoptimized, but still improves the results.

Below are the results for Nvidia A100: optimization_A100

In the text file, you can see the batching parameters that the routine has found. batching_test_A100.txt

sample_1003_benchmark_VkFFT_single_3d_2_512.zip

DTolm commented 1 year ago

As for the increase of coalesced memory to 256b - it solves the issues of big systems (as they benefit from more batching), but will be detrimental to small systems (as some compute units will have no work). Also, it shortens the single upload size and as VkFFT which can also increase the memory transfers. So there has to be a logical explanation for when to batch more.

vincefn commented 1 year ago

Thanks, I have added the groupedBatch parameter to the ones which can be tweaked. It will be more complicated to tune as the range of values is large (compared to simpler parameters like aimThreads and coalescedMemory).

Just to understand - this affects the batched FFTs, or the number of parallel 1D FFTs performed per block ? I mean, if the array is 50x512x512 and the transform is 2D, does this affect how the 512 1D transforms are distributed, or the 50 (batch dimension) ones ?

Here are the benchmark results tweaking just the coalescedMemory parameter: pyvkfft-benchmark-NVIDIA_TITAN_V-cuda-2D-coalmem pyvkfft-benchmark-NVIDIA_TITAN_V-cuda-3D-coalmem pyvkfft-benchmark-Tesla_V100-SXM2-32GB-cuda-2D-coalmem pyvkfft-benchmark-Tesla_V100-SXM2-32GB-cuda-3D-coalmem

DTolm commented 1 year ago

In 50x512x512 case for 1st axis 512 length fft, it will determine how many of the 50x512 total ffts are executed per one threadblock (usually 1-16). It is also indirectly affected by coalescedMemory and aimThreads (but forcing groupedBatch bypasses that).

vincefn commented 1 year ago

Ok, so for a single (non-batched) 3D transform of size 512x512x512, the batchedGroup parameter would not be relevant then (it's the size I was working on when I noticed the speed discrepancy). Thanks !

DTolm commented 1 year ago

It will be relevant and by increasing the groupedBatch parameter for y and z axis you will solve the speed issue.

vincefn commented 1 year ago

OK, it is slow but it works - at least using the cuda backend: with OpenCL (on an nvidia card), I easily end up with a CUDA_ERROR_INVALID_VALUE and program abortion, so I can't really test different configuration.

Results do not seem to be very sensitive to the X-axis batch parameter (I guess it makes sense when not strided).

DTolm commented 1 year ago

I have not tested other backends with groupedBatch option, only CUDA one, will check what is wrong with it there. For X-axis it is indeed less noticeable (only for small systems - cubes up to 100) as bigger systems have good coalescing and thread assignment already.

vincefn commented 1 year ago

Dear @DTolm - regarding the CUDA_ERROR_INVALID_VALUE error with OpenCL, you can probably ignore that. I thinks this only happened with PoCL, which I normally don't test as it has problems handling all VkFFT kernels.

However I've begun the systematic tests on the current VkFFT develop branch (including https://github.com/DTolm/VkFFT/issues/112) and I have a few calculation errors (accuracy failures in my tests) with non-radix transforms, see for example on an A40 using cuda or OpenCL: http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-05-19-a40cl/pyvkfft-test.html http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-05-19-a40cu/pyvkfft-test.html

There are also a number of VKFFT_ERROR_FAILED_TO_COMPILE_PROGRAM for R2C transforms.

Otherwise I've introduced an auto-tuning option in pyvkfft to optimise parameters like aimThreads, coalescedMemory or warpSize and it seems to work well when preparting the VkFFTApp.

Here's an example in 2D on a V100: image or on a A100 in 3D: image

In those cases I just test coalescedMemory at 32, 64 or 128, so it's cheap but effective :-)

DTolm commented 1 year ago

Dear @vincefn

I think I have fixed the C2C/R2C issue by changing how thread indexing works in one of the edge cases.

As for the coalescedMemory tuning - can you share the results as a text file so I can try and generalize the findings? It is good that the code can be made faster by tweaking one number, but the runtime compilation is still an issue in some cases, so I am still not sure if an autotuner should be the default behavior.

Best regards, Dmitrii

vincefn commented 1 year ago

Thanks @DTolm - indeed the tests seem much better:

http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-05-24-a40cu/pyvkfft-test.html (cuda/A40) http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-05-24-a40cl/pyvkfft-test.html (opencl/A40)

Tests are ongoing but the failures for the C2C seem to be gone. However the R2C transforms still give timeout errors for very specific R2C sizes, which I don't understand. But maybe those transforms generate very large kernels which take a long time to compile - surprising but I can try to increase the timeout from 10 to 20s to see)..

As for the coalescedMemory tuning - can you share the results as a text file so I can try and generalize the findings?

I can modify the benchmark script so that it also exports the tuned parameter values.

It is good that the code can be made faster by tweaking one number, but the runtime compilation is still an issue in some cases, so I am still not sure if an autotuner should be the default behavior.

Sure - this will definitely remain an option. Not all GPUs seem to benefit.

PS: is there a simple formula to determine when the transform uses the Rader vs the Bluestein algorithm (based on the prime decomposition) ? I guess that's related to app->configuration.primeSizes but not sure how to use that.

vincefn commented 1 year ago

Hmm. actually the timeout is after 120s, not 10. So it's not a question of compilation time.

If I try without parallel processing (in this cas on a A2000) with --serial, i.e.:

pyvkfft-test --systematic --backend pycuda --gpu a2000 --max-nb-tests 0 --ndim 1 --range 4198 4400 --r2c --bluestein --norm 1 --serial

Then I get a segmentation fault when it gets to the R2C transform with size 4202. That's with a cuda driver 530.30.02 (cuda 12.1) and cuda toolkit 11.7.0 (but the error also appears with the A40 and cuda driver 11.7, same toolkit).

DTolm commented 1 year ago

I see, I checked these sizes on rtx 2080 and you run on a 40 which is Ampere and has more shared memory. These sizes are within single upload limit on it and behave differently compared to Turing, I will fix them later today.

As for the tuning results, it would be best to have all results to compare the relative gains, thank you!

As for Rader's Transform, the formula is - if the sequence is decomposable as multiplication of primes with each prime satisfying that P-1 is decomposable as radix 2-13 multiplication or P being 47, 59 or 83 then it is Rader's algorithm (these are default values and can be tuned). And each prime must be less than a single upload size. Otherwise it is Bluestein's algorithm.

DTolm commented 1 year ago

I have identified the incorrect register calculation issue in C2R and fixed it, but I have done it only with emulating Ampere architecture, so I only checked that the code compiles.

vincefn commented 1 year ago

Looks good so far: http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-05-25-a40cu/pyvkfft-test.html http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-05-25-a40cl/pyvkfft-test.html

..just 1 DCT2 failure of size 4 in OpenCL - but it may be a fluke, I'll retry it separately

vincefn commented 1 year ago

OK - the cuda tests all passed on the A40, but there are some failures using OpenCL (also A40) with VKFFT_ERROR_FAILED_TO_COMPILE_PROGRAM.

I re-tested all failures with less parallel process separately, and you can see the log: http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-05-25-a40cl/log-6821099.txt

There are specific failures with messages: ptxas error : Entry function 'VkFFT_main' uses too much shared data

DTolm commented 1 year ago

The R2C issue has been fixed.

vincefn commented 1 year ago

Thanks @DTolm . The OpenCL test do look OK (still ongoing):

http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-06-01-a40cl/pyvkfft-test.html

Same for cuda: http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-06-01-a40cu/pyvkfft-test.html

vincefn commented 1 year ago

Hi @DTolm - all the previous tests on nvidia cards (A40, V100, GTX 1080) have passed (some occasional memory errors but once re-done with lower parallelism all passed).

However I can now test an AMD Vega56 (gfx900) card, and there are some isolated errors with 1D transforms of size 7700 (C2C float32) and 7700 or 15400 (R2C float32).

http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-06-26-vega56cl/pyvkfft-test.html

The test is still ongoing but I think that 7700 error is significant and not a fluke.

vincefn commented 1 year ago

Besides the errors with sizes 7700/15400, there are errors with non-radix transforms (Rader I guess) for sizes like 3074, 3076, 3078,..

The gfx900 card actually is a Radeon Pro WX 9100

DTolm commented 1 year ago

Hello,

I couldn't reproduce any of the mentioned errors with Radeon Pro VII on Windows, so it will really help if you can send the generated kernels code (with keepShaderCode = 1) for systems 7700 and 3074 in C2C; 7700 and 15400 in R2C? Then we can understand with certainty if it is a driver issue or a code issue.

Best regards, Dmitrii

vincefn commented 1 year ago

Thanks for looking into this !

I have attached the codes for the different transforms.

code.zip

DTolm commented 1 year ago

@vincefn The issue is now fixed, it was related to AMD GPUs apparently allowing only 256 threads per kernel on Linux OpenCL which was scaled incorrectly in some cases before. It was all combined with the fact that Windows and Linux AMD OpenCL drivers also report different max shared memory values (32KB and 64KB respectively). Vulkan can use 1024 threads and full shared memory on both systems though.

I have also restructured the code according to suggestion 2, but haven't tested if make install works yet.

vincefn commented 1 year ago

Thanks !

There's a small issue with the file vkFFT/vkFFT_PlanManagement/vkFFT_Plans/vkFFT_Plan_FFT.H - could you change the extension to a lowercase .h ? I guess windows is case-insensitive but it won't work under linux.

Otherwise it compiles fine, I'll launch the tests.

DTolm commented 1 year ago

Oh, sorry, my bad, Visual Studio has renamed this file for some reason. It should be fixed now.

vincefn commented 1 year ago

Ok, the radix test passed on the gfx900 AMD card, but it failed for non-radix sizes, but for slightly larger sizes - starting at 3999, 4369, etc..

See: http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-07-01-amdgfx900/pyvkfft-test.html

I need to reboot the server with that card - AMD driver crashed again.. I can send the kernel codes if that helps

Test is running fine so far on the A40 and V100.

vincefn commented 1 year ago

Here is the example code (c2c two sizes and r2c, all out-of-place): code.zip

DTolm commented 1 year ago

In some Rader cases, there were no checks for max threads needed in comparison to how many threads device has - for example 17*257 would need 257 threads as the scaler that makes each thread do multiple FFTs doesn't work with Rader's subprime now.

I added the check and it will split the dispatch in two or disable Rader's algorithm in other cases.

Thank you for pointing this out!

vincefn commented 1 year ago

All tests have passed so far (some memory/timeout errors but apparently due to excessive parallel process):

http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-07-04-a40cl/pyvkfft-test.html http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-07-04-a40cu/pyvkfft-test.html http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-07-04-v100cl/pyvkfft-test.html http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-07-04-v100cu/pyvkfft-test.html http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-07-04-a100cl/pyvkfft-test.html http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-07-04-a100cu/pyvkfft-test.html http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-07-04-gtx1080ti_cu/pyvkfft-test.html http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-07-04-amdgfx900/pyvkfft-test.html

The gfx900 test stopped after the first series of non-radix tests (not sure why), I have re-started it (should finish tonight): http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-07-05-amdgfx900/pyvkfft-test.html

It's starting to look really good. Do you plan merging the develop branch and make a new release soon ?

DTolm commented 1 year ago

Yes, I believe the release should be out soon (and it aligns perfectly with the grace period given after the creation of develop branch). I am still not sure how to approach the new autotuning options directly from the C, but this can be done after the release.

vincefn commented 1 year ago

Looks like the inverse R2C non-radix, out-of-place (not yet ran the inplace) transforms still have some issues (but probably only one one value, as the N2 norm looks OK but the N_inf is not), for sizes like 2890, 2926, 3060 on the AMD gfx-900:

http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-07-05-amdgfx900/pyvkfft-test.html

DTolm commented 1 year ago

There was a missing check in an optimization where the C2R data pre-processing could read directly to registers instead of doing a shared memory read/write and it didn't check if the number of participating threads changes for the first radix stage. I have added the check, so these systems should work now.

vincefn commented 1 year ago

There's some issue with this last change - I get a floating point exception when trying an R2C inplace transform of size 34, e.g.:

from pyvkfft.opencl import VkFFTApp 
import pyopencl as cl
import numpy as np
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
n= 34
app = VkFFTApp((n,), np.float32, queue, ndim=1, inplace=False, r2c=True, keepShaderCode=1)

import pyopencl.array as cla
a = cla.to_device(queue, np.random.uniform(0, 1, n).astype(np.float32))
app.fft(a, cla.empty(queue, n//2+1,dtype=np.complex64))

This seems to fail on all nVidia cards and also the AMD. But the Apple M1 runs fine..

DTolm commented 1 year ago

Oh my bad, the check was a little bit faulty. I have hopefully fixed that.

vincefn commented 1 year ago

This looks rather good (some tests are still ongoing), now also with an Apple M1:

http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-07-06-a100cl/pyvkfft-test.html http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-07-06-a100cu/pyvkfft-test.html http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-07-06-a40cl/pyvkfft-test.html http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-07-06-a40cu/pyvkfft-test.html http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-07-06-v100cl/pyvkfft-test.html http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-07-06-v100cu/pyvkfft-test.html http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-07-06-gtx1080ti_cu/pyvkfft-test.html http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-07-06-amdgfx900/pyvkfft-test.html http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-07-06b-apple-m1/pyvkfft-test.html

Let me know when you can merge and tag this for a new release - I'd like to make a new pyvkfft release soon.

I am still not sure how to approach the new autotuning options directly from the C, but this can be done after the release.

I tend to think these king of functionality are best handled in a packaging layer - like pyvkfft - clearly separating the features. I would not want to use the auto-tuning in a C API, if only because I need to keep control on the allocations going on.

DTolm commented 1 year ago

these king of functionality are best handled in a packaging layer

There are many people who use C API directly, and what I was trying to say is that they would also like to have access to improved performance through kernel tuning. And I am not sure yet how to provide them with such functionality (also because of the reasons you mentioned, like allocations control)

As for the release, I will do the limited tests with all the APIs and then do a release in a week if all goes smooth.

vincefn commented 1 year ago

It seems I found a very small corner case not working: on the AMD gfx900, the float64 DCT4 fails for 2D transforms and sizes (255, 255) and (799,799), as well as 3D size (255, 255, 255)...

See http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-07-06-amdgfx900/pyvkfft-test.html

That's really a corner case - I don't put a lot of importance on DCT so not a priority.

Incidentally I'm having a lot of trouble finishing that test suite on the gfx900 card. Somehow once the testsuite reaches the non-radix tests, the card never runs for more than 2-3hours before completely crashing the GPU (rocm-smi crashes) and needing a reboot... Maybe issues with the driver version.

DTolm commented 1 year ago

@vincefn I couldn't reproduce the DCT errors with the same setup I used for R2C errors. May I ask you once again to send the generated code, so I verify that we run the same program?

As for the crashes, I also have no idea what could have happened. OpenCL has not been a priority for vendors for quite a while, so there is a non-zero chance of driver issues appearing.

vincefn commented 1 year ago

Attached the code for the 255x255 DCT4 float64. Interestingly, the actual calculation seems to be hanged... Haven't tried the others.

dct4_255_255.zip