Closed parthraut closed 3 months ago
The latest updates on your projects. Learn more about Vercel for Git ↗︎
Name | Status | Preview | Comments | Updated (UTC) |
---|---|---|---|---|
zeus | ❌ Failed (Inspect) | Apr 1, 2024 1:48am |
Could you rebase to master? There was a small change in docs/requirements.txt.
Thanks a lot for your work! I'll look at the code soon.
Responses to your questions:
nvmlDeviceGetName
) are equal is enough.class PerseusOptimizer(Callback): """Perseus optimizer."""
def __init__(
self,
rank: int,
dp_rank: int,
pp_rank: int,
tp_rank: int,
device_id: int,
dp_degree: int,
pp_degree: int,
tp_degree: int,
world_size: int,
server_url: str,
job_metadata: str | None = None,
) -> None:
"""Initialize the Perseus optimizer.
Assumptions:
- `torch.distributed` has been initialized.
- `torch.cuda.set_device` has been called with `device_id`.
This is needed to broadcast the job ID to all ranks.
The master process (rank 0) will register the job with the Peresus
server and retrieve the job ID of this job. Then, each rank will
report itself to the Perseus server with the job ID.
Args:
rank: Global rank of the current process.
dp_rank: Rank in the data parallel group.
pp_rank: Rank in the pipeline parallel group.
tp_rank: Rank in the tensor parallel group.
**device_id: CUDA device ID that the current process manages.**
dp_degree: Size of the data parallel group.
pp_degree: Size of the pipeline parallel group.
tp_degree: Size of the tensor parallel group.
world_size: Total number of ranks that participate in training.
server_url: URL of the Perseus server.
job_metadata: An optional arbitrary string that describes the job. This will
be appended to the job ID if given. Typically for logging purposes.
"""
Here, is device_id reindexed to respect CUDA_VISIBLE_DEVICES?
for example, if there are four GPUs on system and CUDA_VISIBLE_DEVICES="0,2", to access GPU with CUDA_VISIBLE_DEVICE "2", does it reindex it to be 1 or keep it as 2? Because ZeusMonitor reindexes, but from the comment above I wasn't sure
^ after doing some research, I assume its reindexed. Finished optimizer.py with this assumption.
Correct, 1
is the index that will map to NVML device 2
. The test is failing because you imported torch
in gpu.py
. We actually avoid that in Zeus because we want people to be able to use Zeus without having PyTorch installed -- Someone might be using JAX, someone might just want to measure GPU energy and they're not even doing deep learning, etc. That's why we do this:
https://github.com/ml-energy/zeus/blob/f9275786fd98920b3bf98485c78f9e15b2304364/zeus/util/framework.py#L29-L42
Use the "Add to batch" feature
Apparently this is what's causing doc build failures. Don't worry about doc failing and I think it'll be resolved automatically soon.
Ready for review!
Completed tasks:
✅ remove cuda_sync and setDevice from gpu, put it back where it belongs ✅ changed gpu/get_gpu to torch_is_available()-like to make it scalable
✅ Fixed set Persistence Mode, added TODO comment back to where set Persistence mode SYS_admin check was ✅ included ability to turn off persistence mode
✅ Fixed set PowerManagementLimit - originally broke API consistency. Added resetPowerManagementLimit function. ✅ Fixed: To check ensure homogeneous, just convert to set and check len larger than 1 ✅ declare self.gpus as an abstract property to ensure it exists
✅ Fixed names of exceptions ✅ Fall back to ZeusGPUUnknown Error
BIG fixes: ✅ move lru_cache stuff to gpu Class ✅ Fix tests, and remove gpu_indicies
STYLE Fixes: ✅ Module-level docstring in zeus/init.py and zeus/device/init.py ✅ Fixed docstrings ✅ Removed all private attributes from docstrings
✅ made sure signature of init all match ✅ init should come First ✅ mention unit for each method
Questions and notes: ❓ It looks like it is failing on github actions, but it works perfectly locally. I'm not sure what exactly is the issue. ❓ What do you think about the names of exceptions?
Exception names look good! But is there a need to export all of the exceptions in zeus/devices/__init__.py
? I think just the core API like get_gpu
should be exported and other things kept inside zeus.device.gpu.
Exception names look good! But is there a need to export all of the exceptions in
zeus/devices/__init__.py
? I think just the core API likeget_gpu
should be exported and other things kept inside zeus.device.gpu.
So initially I was only exporting the exceptions used in the source code. But you said not to selectively export some of them, so I added all of them.
PR is mostly done, few notes:
getTotalEnergyConsumption
should be lowercase". We decided on camel case to match pynvml methods, so what should be done to resolve this?