lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
292 stars 99 forks source link

Add `get_visible_devicse_env()` #1356

Closed hummingtree closed 1 year ago

hummingtree commented 1 year ago

Fix #1354.

hummingtree commented 1 year ago

@bjoo, @kostrzewa, @jcosborn can you please check if this is applicable to the different targets. I have added the function for CUDA ...

jxy commented 1 year ago

What's the main purpose of device_order_env here? Is it only used to get device_list.str(), in order to set topology_string? If so, perhaps it is more generic to have a std::string device::get_device_list_string(void)?

I'm suggesting this, because the simple parser here for a comma separated integers may be too restrictive.

hummingtree commented 1 year ago

What's the main purpose of device_order_env here? Is it only used to get device_list.str(), in order to set topology_string? If so, perhaps it is more generic to have a std::string device::get_device_list_string(void)?

I'm suggesting this, because the simple parser here for a comma separated integers may be too restrictive.

@jxy Do we know if we have a target that is not using the comma separated list?

jcosborn commented 1 year ago

The main issue is the devices may not always be specified by a single integer, for example with devices and subdevices. Returning the final string as @jxy suggested seems to be the simplest general solution.

jxy commented 1 year ago

There can be more than one environment variable. Devices can have subdevices. References from Intel:

level zero: https://spec.oneapi.io/level-zero/latest/core/PROG.html#environment-variables sycl: https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md

hummingtree commented 1 year ago

@jxy @jcosborn Please take a look at the last commit :)

jxy commented 1 year ago

If the code always broadcast the string from rank 0, it would be simpler and less error prone to remove the rank argument to the function, and call the function on rank 0 in the main code.

hummingtree commented 1 year ago

If the code always broadcast the string from rank 0, it would be simpler and less error prone to remove the rank argument to the function, and call the function on rank 0 in the main code.

Good point. I have implemented that in https://github.com/lattice/quda/pull/1356/commits/0d228c43b1aade48181cd0e0f83ddeee5b3483fa.

james-simone commented 1 year ago

Is there still a message needing CUDA_VISIBLE_DEVICE ==> CUDA_VISIBLE_DEVICES? errorQuda("Invalid CUDA_VISIBLE_DEVICE ordinal %d"

hummingtree commented 1 year ago

Is there still a message needing CUDA_VISIBLE_DEVICE ==> CUDA_VISIBLE_DEVICES? errorQuda("Invalid CUDA_VISIBLE_DEVICE ordinal %d"

It should be fixed with the last commit of this branch, as there is no more of CUDA_VISIBLE_DEVICE?

dmcdougall commented 1 year ago

I have a few comments and questions.

ROCR_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES are two different methods of controlling device visibility. These control visibility to different software layers. MPI does not target the hip runtime and targets the ROCr runtime. The hip runtime inherits device visibility (and ordering) from the ROCr runtime. Some vendor OpenMP runtimes target the hip runtime, and others target only the ROCr runtime. So the situation is unfortunately a little complicated.

Why does QUDA need to parse any of these variables? Applications usually target the hip runtime. Is there a reason that QUDA can't simply query device visibility by probing hipGetDeviceCount? I suppose you can't query GPU-CPU topology with these, so perhaps that's the motivation behind the need to parse HIP_VISIBLE_DEVICES? Am I understanding this correctly? If so, I see no logic here to understand what CPUs are closest to the GPUs provided by the user in the list of visible devices. Is the expectation that the user does implicitly by specifying the order of GPU devices in HIP_VISIBLE_DEVICES "correctly"?

If slurm is configured with GPU GRES features, using any of them to control visibility and placement (at least on Frontier, Crusher, Spock, and similarly configured systems) makes slurm set the ROCR_VISIBLE_DEVICES variable. If you want to parse HIP_VISIBLE_DEVICES, bear in mind you may have to spend some extra effort educating your users on how to launch QUDA jobs with this feature, and this burden is placed on all clients of QUDA as well, such as Chroma and MILC. You might consider consuming ROCR_VISIBLE_DEVICES instead, and then the user has to "do less" when they launch with slurm and use any of the GPU GRES binding options. This is probably what I would do.

I'm not a slurm expert, but I believe it's also possible to configure slurm GPU GRES features to be driven by cgroups instead. In this case, it's possible that ROCR_VISIBLE_DEVICES isn't set at all, and processes are simply put in a cgroup with the "correct" device visible. If there are slurm experts here and any of this is incorrect, please correct me. What do you expect QUDA to do in this case?

What happens if HIP_VISIBLE_DEVICES is set to the empty string?

maddyscientist commented 1 year ago

@dmcdougall to answer your question, we don't do anything internally in QUDA with CUDA_VISIBILE_DEVICES, and with this PR similar for HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES; we simply encode what the value of these environment variables is into the tuneKey of the stencil communication policies. We do so because the optimum communications strategy can vary as the logical ordering of the GPUs changed, and this ensures we will retune if these values are changed.

@hummingtree I think the conclusion is that we should probably check for both these variables be defined. Sounds like ROCR_VISIBLE_DEVICES is the more common one, so we should check for that first, and if that's not defined, check HIP_VISIBLE_DEVICES.

dmcdougall commented 1 year ago

@maddyscientist I don't understand. Is this information that is saved to disk outside of a full run? The idea is that you do this performance tuning work prior to a big run, and the big run then subsequently reads the stored performance tuning?

One of my points was that the topology can change and it won't be detectable from consuming (CUDA|HIP)_VISIBLE_DEVICES. Maybe this is a rare situation and so therefore it isn't a big concern.

maddyscientist commented 1 year ago

@maddyscientist I don't understand. Is this information that is saved to disk outside of a full run? The idea is that you do this performance tuning work prior to a big run, and the big run then subsequently reads the stored performance tuning?

One of my points was that the topology can change and it won't be detectable from consuming (CUDA|HIP)_VISIBLE_DEVICES. Maybe this is a rare situation and so therefore it isn't a big concern.

We encode the information in the string of dslash (stencil) policies in tunecache.tsv. This ensures that if the ordering changes, e.g., the user is experimenting with the optimal ordering for their application, we rerun the autotuning and generate a new policy for that config.

While this might not matter on systems that are fully p2p connected, it can make a big difference on systems that have a large asymmetry with respect to their intra-node GPU p2p bandwidth. For example, on a dual socket CPU, each with 2 GPUs, using only PCIe: the GPUs on the same socket will likely have more PCIe bandwidth than pairs of GPUs on different sockets. How we order the GPUs with respect to the logical LQCD topology can have a big difference on the optimal communication approach taken. So it's best to encode this (re)ordering into the tune key if is done through changing this env.

dmcdougall commented 1 year ago

Ok, that makes sense. Thanks.

My remaining comment that you may miss a topology change still stands. It's up to you whether or not this is something important enough to worry about. A more robust approach may be to probe what processes see what devices with (cuda|hip)getDeviceCount, and then probing what physical devices those are with (cuda|hip)DeviceGetPCIBusId.

maddyscientist commented 1 year ago

My remaining comment that you may miss a topology change still stands. It's up to you whether or not this is something important enough to worry about. A more robust approach may be to probe what processes see what devices with (cuda|hip)getDeviceCount, and then probing what physical devices those are with (cuda|hip)DeviceGetPCIBusId.

For sure it can be made more robust to cover every use case. We do what we do because these are the type of things that capture our typical experiments. If we need to extend / expand as you suggest, this would be possible.

dmcdougall commented 1 year ago

We do what we do because these are the type of things that capture our typical experiments.

Makes sense.

If we need to extend / expand as you suggest, this would be possible.

Sounds good.

I think the conclusion is that we should probably check for both these variables be defined. Sounds like ROCR_VISIBLE_DEVICES is the more common one, so we should check for that first, and if that's not defined, check HIP_VISIBLE_DEVICES.

Agreed.

hummingtree commented 1 year ago

@dmcdougall to answer your question, we don't do anything internally in QUDA with CUDA_VISIBILE_DEVICES, and with this PR similar for HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES; we simply encode what the value of these environment variables is into the tuneKey of the stencil communication policies. We do so because the optimum communications strategy can vary as the logical ordering of the GPUs changed, and this ensures we will retune if these values are changed.

@hummingtree I think the conclusion is that we should probably check for both these variables be defined. Sounds like ROCR_VISIBLE_DEVICES is the more common one, so we should check for that first, and if that's not defined, check HIP_VISIBLE_DEVICES.

Done.