icl-utk-edu / heffte

BSD 3-Clause "New" or "Revised" License
20 stars 15 forks source link

Adding CI scripts #45

Closed G-Ragghianti closed 2 weeks ago

mkstoyanov commented 5 months ago

Thanks for setting this up, let me know if you run into any issues.

G-Ragghianti commented 5 months ago

One question so far: the intel-mkl package in spack has been deprecated in favor of the oneapi version (intel-oneapi-mkl). We were previously using intel-mkl for the MKL heffte backend. Is it sufficient to use just the ONEAPI backend and deprecate the MKL backend, or do they serve different purposes?

mkstoyanov commented 5 months ago

One question so far: the intel-mkl package in spack has been deprecated in favor of the oneapi version (intel-oneapi-mkl). We were previously using intel-mkl for the MKL heffte backend. Is it sufficient to use just the ONEAPI backend and deprecate the MKL backend, or do they serve different purposes?

The MKL backend is CPU only and doesn't require the Intel SYCL compiler. This is needed for many platforms regardless of the spack deprecation.

However, we don't need a separate test for all possible backend combinations (simply not practical). It is Ok to just test MKL + OneAPI in spack, especially if the stand-alone MKL package has been deperecated. We can even deprecate the standalone +mkl variant in spack (not not in C++).

G-Ragghianti commented 5 months ago

Maybe the +mkl variant in the spack package should be changed to use intel-oneapi-mkl instead of intel-mkl?

mkstoyanov commented 5 months ago

Maybe the +mkl variant in the spack package should be changed to use intel-oneapi-mkl instead of intel-mkl?

Another possibility, haven't tested it but it should work.

G-Ragghianti commented 5 months ago

I have most of the checks passing now. Do you think there is a good chance that we can get the two failing checks to pass, or would you be OK with implementing the CI the way it is now?

mkstoyanov commented 5 months ago

Generally speaking, so long as most of the tests are working, then the CI is good.

The Intel test seems a problem with GPU, it is complaining about missing fp64 capabilities. Is it detecting the correct device?

The ROCm is also puzzling, it is failing to launch a basic scaling kernel, which is just that, scales a vector by a fixed number.

Both of those could be some issue with the environment selecting the correct device and/or permissions. But this is just a guess, I'm not familiar with the environment.

G-Ragghianti commented 5 months ago

For the intel gpu, I think it is actually missing the fp64 capability. The GPU is an Arc770 (consumer card). We had to disable the double precision tests in slate for this. Is that a possibility here?

mkstoyanov commented 5 months ago

For the intel gpu, I think it is actually missing the fp64 capability. The GPU is an Arc770 (consumer card). We had to disable the double precision tests in slate for this. Is that a possibility here?

Unfortunately, heFFTe doesn't have that option at the moment. I could add it in time, but it is a significant intrusion into the code. It's probably best to either disable the test for now, or don't use the GPU, just run SYCL on the CPU (fallback mode). That still tests the GPU kernels and it will work with all precision.

G-Ragghianti commented 5 months ago

Just FYI: I'm keeping this PR as a draft until we finalize the slurm queue policies at our site. The new CI system runs the jobs through our queues, and I need to make sure the CI jobs don't interfere with normal users.

mkstoyanov commented 5 months ago

Just FYI: I'm keeping this PR as a draft until we finalize the slurm queue policies at our site. The new CI system runs the jobs through our queues, and I need to make sure the CI jobs don't interfere with normal users.

No worries, I won't merge anything until you say it's ready (at which point you can merge it yourself).

ax3l commented 3 weeks ago

Hi, this is a great PR to add CI - do you like to update and merge it maybe partially already? :)

(Our interest: We want to start depending on heFFTe in WarpX & ImpactX mainline, and having CI in a dependency is one of our quality criteria :))

mkstoyanov commented 3 weeks ago

The PR is @G-Ragghianti's work as it is connected to the machines at IC. Given the lack of post-ECP funding, I don't know what the status is here or if this is even feasible at the moment.

I am doing now is I have a series of automated scripts that run docker, same as a CI would except it is not automatically triggered from github. That's always been a challenge due to security and authentication.

The free github runners are a bit deficient, no GPU support of any kind and heFFTe needs at least 12 MPI ranks to have a non-symmetric data layout of boxes. I can try to setup one or two free github tests, would that be sufficient? Or do I need to find some alternatives.

G-Ragghianti commented 3 weeks ago

We intend to run these on our site-hosted runners. They are currently running as dedicated runners, but this PR was created specifically to use an on-demand runner system that we have yet to fully deploy. I can convert this PR to use the dedicated runners that we currently have. We intend to support this despite the end of ECP funding.

mkstoyanov commented 3 weeks ago

@G-Ragghianti it would be of great help to support this, even if the tests don't span the entire range of supported GPUs and build configurations. I don't know if @ax3l has a time-frame in mind.

G-Ragghianti commented 3 weeks ago

I've updated the PR to use the currently available self-hosted runners that we have. There is still a small problem with FFTW module not being compatible with our module for openmpi. I will fix this, and then it can be merged.

ax3l commented 2 weeks ago

Yay! All green :)