lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
279 stars 94 forks source link

current develop does not compile with ROCm 5.2.3 (LUMI-G default) #1432

Closed kostrzewa closed 4 months ago

kostrzewa commented 5 months ago

It seems that the preparations for ROCm 6 have broken compilation with our current production stack based on ROCm 5.2.3 on LUMI-G (at least for me). Note that 5.2.3 is the default on the machine and the only "officially supported" version as far as I can tell.

https://github.com/lattice/quda/blob/273d4fe8dca06fbc52b209ab7ee27bdf83d6c4bd/lib/targets/hip/malloc.cpp#L531

/users/bakostrz/code/quda-develop-273d4fe/lib/targets/hip/malloc.cpp:531:18: error: no member named 'type' in 'hipPointerAttribute_t'
    switch (attr.type) {
            ~~~~ ^
/users/bakostrz/code/quda-develop-273d4fe/lib/targets/hip/malloc.cpp:539:57: error: no member named 'type' in 'hipPointerAttribute_t'
    default: errorQuda("Unknown memory type %d\n", attr.type); return QUDA_INVALID_FIELD_LOCATION;
                                                   ~~~~ ^
/users/bakostrz/code/quda-develop-273d4fe/lib/../include/util_quda.h:76:30: note: expanded from macro 'errorQuda'
    fprintf(getOutputFile(), __VA_ARGS__);                                                                             \
                             ^~~~~~~~~~~
/users/bakostrz/code/quda-develop-273d4fe/lib/targets/hip/malloc.cpp:539:57: error: no member named 'type' in 'hipPointerAttribute_t'
    default: errorQuda("Unknown memory type %d\n", attr.type); return QUDA_INVALID_FIELD_LOCATION;
                                                   ~~~~ ^
/users/bakostrz/code/quda-develop-273d4fe/lib/../include/util_quda.h:77:74: note: expanded from macro 'errorQuda'
    errorQuda_(__PRETTY_FUNCTION__, quda::file_name(__FILE__), __LINE__, __VA_ARGS__);                                 \
                                                                         ^~~~~~~~~~~
3 errors generated when compiling for gfx90a.
make[2]: *** [lib/CMakeFiles/quda_cpp.dir/build.make:1070: lib/CMakeFiles/quda_cpp.dir/targets/hip/malloc.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:1039: lib/CMakeFiles/quda_cpp.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

While other ROCm versions are available on the machine, they are all marked "experimental".

It is quite unfortunate that there has been a rename between 5.2.x and 5.y.x of the hipMemoryType member of hipPointerAttribute_t from memoryType to type. There seem to be a couple of intermediate versions which have both in the form of a union.

@dmcdougall do you think it might be possible to support both 5.2.x and later versions at least for a while?

stevengottlieb commented 5 months ago

I ran into the same problem on Crusher. I hope this will be fixed soon.

Steve

On Jan 21, 2024, at 4:51 AM, Bartosz Kostrzewa @.***> wrote:

It seems that the preparations for ROCm 6 have broken compilation with our current production stack based on ROCm 5.2.3 on LUMI-G (at least for me). Note that 5.2.3 is the default on the machine and the only "officially supported" version as far as I can tell.

https://github.com/lattice/quda/blob/273d4fe8dca06fbc52b209ab7ee27bdf83d6c4bd/lib/targets/hip/malloc.cpp#L531

/users/bakostrz/code/quda-develop-273d4fe/lib/targets/hip/malloc.cpp:531:18: error: no member named 'type' in 'hipPointerAttribute_t' switch (attr.type) {


/users/bakostrz/code/quda-develop-273d4fe/lib/targets/hip/malloc.cpp:539:57: error: no member named 'type' in 'hipPointerAttribute_t'
    default: errorQuda("Unknown memory type %d\n", attr.type); return QUDA_INVALID_FIELD_LOCATION;
                                                   ~~~~ ^
/users/bakostrz/code/quda-develop-273d4fe/lib/../include/util_quda.h:76:30: note: expanded from macro 'errorQuda'
    fprintf(getOutputFile(), __VA_ARGS__);                                                                             \
                             ^~~~~~~~~~~
/users/bakostrz/code/quda-develop-273d4fe/lib/targets/hip/malloc.cpp:539:57: error: no member named 'type' in 'hipPointerAttribute_t'
    default: errorQuda("Unknown memory type %d\n", attr.type); return QUDA_INVALID_FIELD_LOCATION;
                                                   ~~~~ ^
/users/bakostrz/code/quda-develop-273d4fe/lib/../include/util_quda.h:77:74: note: expanded from macro 'errorQuda'
    errorQuda_(__PRETTY_FUNCTION__, quda::file_name(__FILE__), __LINE__, __VA_ARGS__);                                 \
                                                                         ^~~~~~~~~~~
3 errors generated when compiling for gfx90a.
make[2]: *** [lib/CMakeFiles/quda_cpp.dir/build.make:1070: lib/CMakeFiles/quda_cpp.dir/targets/hip/malloc.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:1039: lib/CMakeFiles/quda_cpp.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

While other ROCm versions are available on the machine, they are all marked "experimental".

It is quite unfortunate that there has been a rename between 5.2.x and 5.y.x of the hipMemoryType member of hipPointerAttribute_t from memoryType to type. There seem to be a couple of intermediate versions which have both in the form of a union.

@dmcdougall<https://github.com/dmcdougall> do you think it might be possible to support both 5.2.x and later versions at least for a while?

—
Reply to this email directly, view it on GitHub<https://github.com/lattice/quda/issues/1432>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABGG3BJ4FDQQMSZN53KUP4TYPTQI5AVCNFSM6AAAAABCD3NV3KVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA4TENJQG43TMMY>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
kostrzewa commented 5 months ago

@stevengottlieb I've found that at least on on LUMI-G, the ROCm-5.6.1-based stack (that is provided by the LUMI admins in addition to the official software stack by HPE) seems to work as long as one disables P2P. Luckily GPU-aware MPI works at least, which is what matters most for QUDA-HIP.

I don't see anything in the Crusher docs to indicate that HPE/ORNL provide anything beyond ROCM 5.4.0 on Crusher but I don't have access to the machine so I can't check if there is not perhaps an unadvertised module somewhere.

kostrzewa commented 5 months ago

Addendum to the comment above: this is with cray-mpich/8.1.27.

kostrzewa commented 4 months ago

@dmcdougall Any news on getting rocm 5.2.3 support back into QUDA? Between the changes here and the delays on the HPE side (I guess) in making a newer official software stack available we are stuck without an offloaded fermion force on LUMI-G.

kostrzewa commented 4 months ago

I've found that at least on on LUMI-G, the ROCm-5.6.1-based stack (that is provided by the LUMI admins in addition to the official software stack by HPE) seems to work as long as one disables P2P. Luckily GPU-aware MPI works at least, which is what matters most for QUDA-HIP.

Lots of issues with this unofficial rocm 5.6.1 unfortunately to the point that it's unusable.

dmcdougall commented 4 months ago

I'm so sorry for the delayed response here. I didn't see this until the most recent ping. My sincere apologies.

ROCm 5.2.3 is extremely old. QUDA is an extremely difficult application for compilers to handle and AMD have addressed several internal compiler errors, codegen bugs, and double-free bugs in the KFD since 5.2.3. My suggestion here would be to raise a polite request to the LUMI-G system administrators to update to the latest ROCm stack. There are legitimate and very important software bug fixes that have happened since ROCm 5.2. This approach also helps all the other LUMI-G users, and not just the ones running into problems with QUDA.

If you're having issues with 5.6, I wonder if you're mixing a ROCm 5.6 userland with the ROCm 5.2 driver. This is not supported at all, and not guaranteed to work. You can typically be pretty successful with a userland version that is at most two versions (in either direction) against a given driver version.

kostrzewa commented 4 months ago

Thanks a lot for getting back to me on this, it is much appreciated.

My suggestion here would be to raise a polite request to the LUMI-G system administrators to update to the latest ROCm stack. There are legitimate and very important software bug fixes that have happened since ROCm 5.2. This approach also helps all the other LUMI-G users, and not just the ones running into problems with QUDA.

We've been trying this for at least half a year now. This is also not just a problem on LUMI-G but also on Crusher AFAIK, see the list of rocm versions available there: https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html#determining-the-compatibility-of-cray-mpich-and-rocm

If you're having issues with 5.6, I wonder if you're mixing a ROCm 5.6 userland with the ROCm 5.2 driver. This is not supported at all

Of course that's exactly what's happening on LUMI-G: the only version of rocm officially supported by HPE on the machine is 5.2.3 and that's also the driver version. The LUMI admins have provided a frankenversion of 5.6.1 in the hope of helping users who have encountered issues with older versions but they do state explicitly that this is not officially supported.

So for us rocm 5.2.3 is the only version which actually works on LUMI-G and it's very unfortunate that the current QUDA develop head commit does not contain a workaround to still work with that.

dmcdougall commented 4 months ago

I think we're talking about two different things.

The situation on Crusher is different. Crusher is running the latest GPU driver:

[damon@crusher029.crusher ~]$ rocm-smi --showdriverversion

============================ ROCm System Management Interface ============================
============================== Version of System Component ===============================
Driver version: 6.3.6
==========================================================================================
================================== End of ROCm SMI Log ===================================

This driver version is from ROCm 6.0.

Crusher also has deployments of ROCms up to 6.0.0:

[damon@login2.crusher ~]$ ml avail rocm

----------------------------------------------------------------------------------------------------------------------------- /sw/crusher/modulefiles -----------------------------------------------------------------------------------------------------------------------------
   papi/7.0.1.0_rocm5.3    rocm/4.2.0    rocm/4.3.0    rocm/4.5.0    rocm/4.5.2    rocm/5.0.0    rocm/5.0.2    rocm/5.1.0    rocm/5.2.0    rocm/5.3.0 (D)    rocm/5.4.0    rocm/5.4.3    rocm/5.5.1    rocm/5.6.0    rocm/5.7.0    rocm/5.7.1    rocm/6.0.0

There are older versions there, too. But the latest stack is available, and it is compatible with the CPE that is deployed on Crusher.

The page you're referring to hasn't been updated in over a year. The information on that page was (and still is) correct, but there are newer versions available with associated CPE compatibility requirements that are not listed on that page. I can let the OLCF folks know about this page and help them update it. Thanks for bringing that page to my attention.

It's been a while since I've built QUDA on Crusher and Frontier, but I've worked with Balint Joo to address both correctness and performance-related issues with QUDA in 5.5 and 5.6, so if my memory serves me correctly, QUDA did successfully build with 5.6. There was also work I did to prepare QUDA (#1415 and #1418) for the ROCm 6 release which contained some breaking changes (namely a header-file re-org and the removal of the memoryType member from the hipPointerAttributes_t type). The header file re-org breaking changes were documented in ROCm 5.5, and users were warned at compile time whenever they pointed to the old header file locations until ROCm 6.0 when the old header locations were removed. Additionally, there was a period of three ROCm releases where users were given the union to prepare for the upcoming breaking change. The union existed ROCm 5.5, 5.6, and 5.7. This was documented in ROCm 5.6. The union was removed in 6.0, breaking compatibility in a major release.

Of course, users that were on versions older than ROCm 5.5 never saw the header file warnings, and never had the opportunity to gracefully move to the new type field name in those interim ROCm releases.

The situation on LUMI-G is different because both the driver and the userland are almost two years old. These need to be updated. ROCms newer than 5.4 aren't guaranteed to work with a driver from 5.2. Updating the software stack is critical.

With all of this said, I have tried to address your concern in #1445. I wish you success with ROCm 5.2, but I will re-iterate that you are working with a compiler that is almost two years old, and there are bugs that QUDA triggered in the ROCm compiler that have been addressed since ROCm 5.2.

I'm sorry that I can't be more helpful.

kostrzewa commented 4 months ago

The situation on Crusher is different. Crusher is running the latest GPU driver:

I'm happy to hear that. I was just referring to the page because @stevengottlieb mentioned above that he was having similar problems. I guess these are then really different issues.

The situation on LUMI-G is different because both the driver and the userland are almost two years old. These need to be updated. ROCms newer than 5.4 aren't guaranteed to work with a driver from 5.2. Updating the software stack is critical.

I hope that the LUMI admins / HPE will get around to it

With all of this said, I have tried to address your concern in https://github.com/lattice/quda/pull/1445. I wish you success with ROCm 5.2, but I will re-iterate that you are working with a compiler that is almost two years old, and there are bugs that QUDA triggered in the ROCm compiler that have been addressed since ROCm 5.2.

Thanks a lot, I will test this out as soon as possible. I really hope the LUMI admins / HPE will upgrade the driver and software stack this year but in the meantime your workaround in #1445 should make us able to compile current QUDA versions.

It appears that we have not yet hit any of the issues with ROCm 5.2.3 that you describe, but I will keep your warning in mind and stress again with the admins how crucial an upgrade of the software stack would be.

kostrzewa commented 4 months ago

There are older versions there, too. But the latest stack is available, and it is compatible with the CPE that is deployed on Crusher.

This is very valuable information, thanks. I'll try to discuss again with the LUMI admins.

kostrzewa commented 4 months ago

Thanks for #1445 !

dmcdougall commented 4 months ago

You're welcome.