Open ca-taylor opened 2 years ago
Thanks for the report.
I can probably fix this by simply using the available device if CUDA_VISIBLE_DEVICES only has one device marked as avaiable.
A little more info. When requesting --gpus-per-node=a100:8, we see the following in the logs and the job runs without issue.
remd_000.log:c1106a-s11.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_001.log:c1106a-s17.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_002.log:c1106a-s23.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_003.log:c1107a-s29.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_004.log:c1109a-s17.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_005.log:c1109a-s29.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_006.log:c1106a-s11.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_007.log:c1106a-s17.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_008.log:c1106a-s23.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_009.log:c1107a-s29.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_010.log:c1109a-s17.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_011.log:c1109a-s29.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_012.log:c1106a-s11.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_013.log:c1106a-s17.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_014.log:c1106a-s23.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_015.log:c1107a-s29.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_016.log:c1109a-s17.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_017.log:c1109a-s29.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_018.log:c1106a-s11.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_019.log:c1106a-s17.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_020.log:c1106a-s23.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_021.log:c1107a-s29.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_022.log:c1109a-s17.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_023.log:c1109a-s29.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_024.log:c1106a-s11.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_025.log:c1106a-s17.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_026.log:c1106a-s23.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_027.log:c1107a-s29.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_028.log:c1109a-s17.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_029.log:c1109a-s29.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_030.log:c1106a-s11.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_031.log:c1106a-s17.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_032.log:c1106a-s23.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_033.log:c1107a-s29.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_034.log:c1109a-s17.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_035.log:c1109a-s29.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_036.log:c1106a-s11.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_037.log:c1106a-s17.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_038.log:c1106a-s23.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_039.log:c1107a-s29.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_040.log:c1109a-s17.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_041.log:c1109a-s29.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_042.log:c1106a-s11.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_043.log:c1106a-s17.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_044.log:c1106a-s23.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_045.log:c1107a-s29.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_046.log:c1109a-s17.ufhpc found cuda devices: 0,1,2,3,4,5,6,7 remd_047.log:c1109a-s29.ufhpc found cuda devices: 0,1,2,3,4,5,6,7
When requesting, --gpus-per-task=1, the logs look like below and we hit the runtime exception described previously.
remd_000.log:c0909a-s23.ufhpc found cuda devices: 0 remd_001.log:c0910a-s11.ufhpc found cuda devices: 0 remd_002.log:c1001a-s11.ufhpc found cuda devices: 0 remd_003.log:c1009a-s11.ufhpc found cuda devices: 0 remd_004.log:c1009a-s17.ufhpc found cuda devices: 0 remd_005.log:c1009a-s35.ufhpc found cuda devices: 0 remd_006.log:c0909a-s23.ufhpc found cuda devices: 1 remd_007.log:c0910a-s11.ufhpc found cuda devices: 1 remd_008.log:c1001a-s11.ufhpc found cuda devices: 1 remd_009.log:c1009a-s11.ufhpc found cuda devices: 1 remd_010.log:c1009a-s17.ufhpc found cuda devices: 1 remd_011.log:c1009a-s35.ufhpc found cuda devices: 1 remd_012.log:c0909a-s23.ufhpc found cuda devices: 2 remd_013.log:c0910a-s11.ufhpc found cuda devices: 2 remd_014.log:c1001a-s11.ufhpc found cuda devices: 2 remd_015.log:c1009a-s11.ufhpc found cuda devices: 2 remd_016.log:c1009a-s17.ufhpc found cuda devices: 2 remd_017.log:c1009a-s35.ufhpc found cuda devices: 2 remd_018.log:c0909a-s23.ufhpc found cuda devices: 3 remd_019.log:c0910a-s11.ufhpc found cuda devices: 3 remd_020.log:c1001a-s11.ufhpc found cuda devices: 3 remd_021.log:c1009a-s11.ufhpc found cuda devices: 3 remd_022.log:c1009a-s17.ufhpc found cuda devices: 3 remd_023.log:c1009a-s35.ufhpc found cuda devices: 3 remd_024.log:c0909a-s23.ufhpc found cuda devices: 4 remd_025.log:c0910a-s11.ufhpc found cuda devices: 4 remd_026.log:c1001a-s11.ufhpc found cuda devices: 4 remd_027.log:c1009a-s11.ufhpc found cuda devices: 4 remd_028.log:c1009a-s17.ufhpc found cuda devices: 4 remd_029.log:c1009a-s35.ufhpc found cuda devices: 4 remd_030.log:c0909a-s23.ufhpc found cuda devices: 5 remd_031.log:c0910a-s11.ufhpc found cuda devices: 5 remd_032.log:c1001a-s11.ufhpc found cuda devices: 5 remd_033.log:c1009a-s11.ufhpc found cuda devices: 5 remd_034.log:c1009a-s17.ufhpc found cuda devices: 5 remd_035.log:c1009a-s35.ufhpc found cuda devices: 5 remd_036.log:c0909a-s23.ufhpc found cuda devices: 6 remd_037.log:c0910a-s11.ufhpc found cuda devices: 6 remd_038.log:c1001a-s11.ufhpc found cuda devices: 6 remd_039.log:c1009a-s11.ufhpc found cuda devices: 6 remd_040.log:c1009a-s17.ufhpc found cuda devices: 6 remd_041.log:c1009a-s35.ufhpc found cuda devices: 6 remd_042.log:c0909a-s23.ufhpc found cuda devices: 7 remd_043.log:c0910a-s11.ufhpc found cuda devices: 7 remd_044.log:c1001a-s11.ufhpc found cuda devices: 7 remd_045.log:c1009a-s11.ufhpc found cuda devices: 7 remd_046.log:c1009a-s17.ufhpc found cuda devices: 7 remd_047.log:c1009a-s35.ufhpc found cuda devices: 7
In each case above, we are running 48 MPI ranks (slurm tasks) across 6 nodes with 8 GPUs per node.
Thanks for the report.
I can probably fix this by simply using the available device if CUDA_VISIBLE_DEVICES only has one device marked as avaiable.
I've tested the following naive/simple patch with both "--gpus-per-node=8" and "--gpus-per-task=1". It seems to work as intended in both cases.
--- comm.py.orig 2022-01-20 13:27:10.943950885 -0500
+++ comm.py 2022-01-25 08:26:46.094861471 -0500
@@ -325,6 +325,12 @@
]
if not visible_devices:
raise RuntimeError("No cuda devices available")
+ else:
+ if len(visible_devices) == 1:
+ logger.info("negotiate_device_id: visible_devices contains a single device: %d", visible_devices[0])
+ device_id = 0
+ logger.info("hostname: %s, device_id: %d", hostname, device_id)
+ return device_id
except KeyError:
logger.info("%s CUDA_VISIBLE_DEVICES is not set.", hostname)
visible_devices = None
Unfortunately, while that might work for this specific system, I don't think that will work in the general case.
Imagine that a bunch of nodes have multiple devices available, but one node only has access to a single device. The node with the single device will return early from the function, while the remaining nodes will be stalled waiting for that node to complete the mpi gather command. This quick fix only works if every node has exactly one device available.
Do you know of anywhere this change to slurm is documented? We haven't seen this on any of our systems, but we might not be running the latest versions. We might need to implement special handling for newer versions of slurm. If it is per-rank, then this entire function is largely redundant. The only reason this function exists is to make sure each rank is using a unique gpu. If slurm now handles this correctly, we can just check that there is on visible device and simply use it without needed to do any mpi communication. I just would like to know a bit more about what changed with recent versions before going that route.
Unfortunately, while that might work for this specific system, I don't think that will work in the general case.
Yes, you are correct.
Imagine that a bunch of nodes have multiple devices available, but one node only has access to a single device. The node with the single device will return early from the function, while the remaining nodes will be stalled waiting for that node to complete the mpi gather command. This quick fix only works if every node has exactly one device available.
Again, yes, I can see the problem. I wanted to solve it within the context of the scatter/gather code but was having a little bit of difficulty understanding it. I'll spend a little more time on it and see if I can get my arms around it.
Do you know of anywhere this change to slurm is documented?
No, I do not. Unfortunately, SchedMD does not document these sorts of things in much, if any, detail.
We haven't seen this on any of our systems, but we might not be running the latest versions.
The problem arose when we upgraded from 20.11.8 to 21.08.4.
We might need to implement special handling for newer versions of slurm. If it is per-rank, then this entire function is largely redundant. The only reason this function exists is to make sure each rank is using a unique gpu. If slurm now handles this correctly, we can just check that there is on visible device and simply use it without needed to do any mpi communication. I just would like to know a bit more about what changed with recent versions before going that route.
I don't think that you do want to go that route for the reasons you stated. I think it will work generally in the case of "--gpus-per-task" but there are a variety of ways (too many now) of asking for GPUs in a resource request and I think that the assignment of CUDA_VISIBLE_DEVICES will continue to depend on HOW the gpus are requested (unfortunately).
As I said, I'll continue to look at it as I have time and maybe I'll figure out what needs to change in the scatter/gather algorithm to make it work - seems like it should anyway but we hit the runtime exception so something is not quite right.
After looking at the code some more and putting in some additional debugging for my own understanding, I think that my original assessment is correct. The runtime exception seems to be based on two assumptions.
1. That if CUDA_VISIBLE_DEVICES is set on the leader (rank 0), then it will be set in all the ranks. For SLURM, at least, this is a safe assumption.
2. That, if set, CUDA_VISIBLE_DEVICES will be the same for each rank (i.e. task) on a given node.
Assumption (2) is the problem. If CUDA_VISIBLE_DEVICES is set on a per host basis, then the assumption is correct and we do not hit the runtime exception. This is the case when the GPUs are requested using the "--gpus-per-node=N" sbatch directive.
However, if CUDA_VISIBLE_DEVICES is set on a per rank basis, then each CUDA_VISIBLE_DEVICE string on a given host will, in fact, be different and assumption (2) is no longer valid and we hit the runtime exception referenced earlier. This is the case under SLURM 21.08.4 when GPUs are requested using "--gpus-per-task=N". Hence, we hit the runtime exception.
Solving this in a completely general way seems problematic since you need to know whether CUDA_VISIBLE_DEVICES was set on a per host or per rank basis. You could do this under SLURM with the SLURM_GPUS_PER_TASK and/or SLURM_GPUS_PER_NODE environment variables but that, of course, is specific to SLURM and I doubt you want that in your code.
Also, things get a little bit turned on their heads because, as I showed in my initial patch, when --gpus-per-task=1, we have a case where each MPI rank (i.e. slurm task, i.e. unix process) will simply see device 0 and can safely use it. So building a list and scattering it is kind of a waste of effort.
Hopefully, all that makes sense. It is not a solution but at least I think I understand the problem better.
Yeah, I think you've articulated the issue.
The design was based around CUDA_VISIBLE_DEVICES
being set per-node, which was the case for SGE (at least when we initially developed this, we havne't used SGE in forever) and for SLURM.
I don't see a general way to handle this. It doesn't seem as though MELD can autmoatically determine which regime it is running under. So, it might be necessary to specify that in some way when creating the communicator. On the other hand, I'm not opposed to adding some special handling using those environment variables. MELD is run almost exclusively on systems using SLURM, at least currently.
I've made some changes to negotiate_device_id() in comm.py that seem to handle at least the following correctly. --gpus-per-task=1 --gpus-per_node=N --gres=gpu:N
The code relies on the "SLURM_GPUS_PER_TASK" environment variable to avoid throwing the "GPU devices for host do not match" runtime error. Let me know if you are interested in it and I can get you a patch or a pull request or whatever.
If you prefer to spin your own, that's fine. We'll just run the code here until we have an upstream fix.
A PR would be great.
We upgraded to SLURM 21.08.4 this week and promptly ran into the following runtime error with jobs that previously ran without issue.
` When requesting N tasks and --gpus-per-task=1, it seems that SLURM now sets CUDA_VISIBLE_DEVICES on a per rank basis rather than a per node basis. Thus each MPI rank now sees a single GPU index in CUDA_VISIBLE_DEVICES.
If we change the resource request to specify ntasks-per-node and gpus-per-node, CUDA_VISIBLE_DEVICES contains an index for each GPU assigned on the node and MELD runs as expected without hitting the above exception.
I'm not saying that this is a bug in MELD per se, but the assumption that CUDA_VISIBLE_DEVICES is going to be set on a per node basis can no longer be relied upon. That is to say that the above test will fail and raise a runtime exception for a perfectly valid SLURM resource request.
Hopefully, that makes sense. Let me know if you have any questions or if I'm not clear or misunderstanding something.