lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
292 stars 99 forks source link

`comm_init_common` (optionally) explicitly relies on the existence of `CUDA_VISIBLE_DEVICES`, ignores `HIP_VISIBLE_DEVICES` #1354

Closed kostrzewa closed 1 year ago

kostrzewa commented 1 year ago

While trying to debug a comms problem that we have using the HIP backend in tmLQCD+QUDA I notived that the topology string explicitly makes use of CUDA_VISIBLE_DEVICES (if defined) while the corresponding HIP environment variable is ignored. It's inconsequential for the case that I'm debugging, I thought I'd post it here as an issue nonetheless.

https://github.com/lattice/quda/blob/7d88b615989418ef8664ab2633796f5d9fc5d5e8/include/communicator_quda.h#L562-L587

hummingtree commented 1 year ago

I think we need a target specific function to get the visible devices.

maddyscientist commented 1 year ago

Seems like an easy add. @kostrzewa fancy having a go at adding this target abstracted functionality?

Tagging @bjoo and @kostrzewa for visibility. @jcosborn is there any equivalent functionality for SYCL?

maddyscientist commented 1 year ago

(@kostrzewa: though I see now that @hummingtree just volunteered himself, so feel free to ignore my request...)

bjoo commented 1 year ago

Hi All, Since I never bound via those variables explicitly, just used the regular init, I have not yet come across these. There is one issue potentially is that there are 2 of these variables HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES. It would be worth implementing a fix. I may want to ask our red-team friends for advice… Best, B

-- Balint Joo, Oak Ridge Leadership Computing Facility, Oak Ridge National Laboratory P.O. Box 2008, 1 Bethel Valley Road, Oak Ridge, TN 37831, USA email: joob AT ornl.gov. Tel: +1-757-912-0566 (cell, remote)

From: maddyscientist @.> Reply-To: lattice/quda @.> Date: Monday, January 30, 2023 at 6:55 PM To: lattice/quda @.> Cc: "Joo, Balint" @.>, Mention @.***> Subject: [EXTERNAL] Re: [lattice/quda] comm_init_common (optionally) explicitly relies on the existence of CUDA_VISIBLE_DEVICES, ignores HIP_VISIBLE_DEVICES (Issue #1354)

Seems like an easy add. @kostrzewahttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkostrzewa&data=05%7C01%7Cjoob%40ornl.gov%7C90034c0ad08342f3095708db031d63ce%7Cdb3dbd434c4b45449f8a0553f9f5f25e%7C1%7C0%7C638107197004133719%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=uvc2RI4QydDe5QLkqZGnu2FIV2%2FZ9AuyNr5No%2FvV13c%3D&reserved=0 fancy having a go at adding this target abstracted functionality?

Tagging @bjoohttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbjoo&data=05%7C01%7Cjoob%40ornl.gov%7C90034c0ad08342f3095708db031d63ce%7Cdb3dbd434c4b45449f8a0553f9f5f25e%7C1%7C0%7C638107197004133719%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=0oYyNAHVKHGvaAfJdmqoDN0502mck0iGXECCMuawVkQ%3D&reserved=0 and @kostrzewahttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkostrzewa&data=05%7C01%7Cjoob%40ornl.gov%7C90034c0ad08342f3095708db031d63ce%7Cdb3dbd434c4b45449f8a0553f9f5f25e%7C1%7C0%7C638107197004133719%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=uvc2RI4QydDe5QLkqZGnu2FIV2%2FZ9AuyNr5No%2FvV13c%3D&reserved=0 for visibility. @jcosbornhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjcosborn&data=05%7C01%7Cjoob%40ornl.gov%7C90034c0ad08342f3095708db031d63ce%7Cdb3dbd434c4b45449f8a0553f9f5f25e%7C1%7C0%7C638107197004133719%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=KZo6bYLE1pE1tfBVGymI%2Fh97UAq0B2zrtFPuyf7Snfw%3D&reserved=0 is there any equivalent functionality for SYCL?

— Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flattice%2Fquda%2Fissues%2F1354%23issuecomment-1409536614&data=05%7C01%7Cjoob%40ornl.gov%7C90034c0ad08342f3095708db031d63ce%7Cdb3dbd434c4b45449f8a0553f9f5f25e%7C1%7C0%7C638107197004133719%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=lr9Fm9HSv4IejcmmN488wkPvKqD7IVCuKDyjXVh72uo%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAEPL2DEZL7X27HQBMOEDCDWVBIEFANCNFSM6AAAAAAULL2H2M&data=05%7C01%7Cjoob%40ornl.gov%7C90034c0ad08342f3095708db031d63ce%7Cdb3dbd434c4b45449f8a0553f9f5f25e%7C1%7C0%7C638107197004133719%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2FJGPnQuBPHHwicXCMkH7jJOaXB6L4H2gXJQ5clsByEY%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>