ml-energy / zeus

Deep Learning Energy Measurement and Optimization
https://ml.energy/zeus
Apache License 2.0
179 stars 24 forks source link

Abstracting away the gpu #46

Closed parthraut closed 3 months ago

parthraut commented 3 months ago

PR is mostly done, few notes:

  1. ensure_homogeneous has not been implemented, I was a little confused on what it should check. Should it check that all gpus are the same kind? Ex. all GPUs must be the same kind and GPU architecture
  2. Some methods for AMD GPU class have not been implemented, as well as handling the exceptions. I focused on NVIDIA gpus for this PR, but extending this should not be difficult if needed.
  3. Style: Failing because many names have uppercase letters: "N802 Function name getTotalEnergyConsumption should be lowercase". We decided on camel case to match pynvml methods, so what should be done to resolve this?
vercel[bot] commented 3 months ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
zeus ❌ Failed (Inspect) Apr 1, 2024 1:48am
jaywonchung commented 3 months ago

Could you rebase to master? There was a small change in docs/requirements.txt.

jaywonchung commented 3 months ago

Thanks a lot for your work! I'll look at the code soon.

Responses to your questions:

  1. I think ensuring that the name of the devices (nvmlDeviceGetName) are equal is enough.
  2. Sounds good!
  3. See bottom of pyproject.toml. There you can disable specific Ruff error code on entire files.
parthraut commented 3 months ago

class PerseusOptimizer(Callback): """Perseus optimizer."""

def __init__(
    self,
    rank: int,
    dp_rank: int,
    pp_rank: int,
    tp_rank: int,
    device_id: int,
    dp_degree: int,
    pp_degree: int,
    tp_degree: int,
    world_size: int,
    server_url: str,
    job_metadata: str | None = None,
) -> None:
    """Initialize the Perseus optimizer.

    Assumptions:
        - `torch.distributed` has been initialized.
        - `torch.cuda.set_device` has been called with `device_id`.
            This is needed to broadcast the job ID to all ranks.

    The master process (rank 0) will register the job with the Peresus
    server and retrieve the job ID of this job. Then, each rank will
    report itself to the Perseus server with the job ID.

    Args:
        rank: Global rank of the current process.
        dp_rank: Rank in the data parallel group.
        pp_rank: Rank in the pipeline parallel group.
        tp_rank: Rank in the tensor parallel group.
        **device_id: CUDA device ID that the current process manages.**
        dp_degree: Size of the data parallel group.
        pp_degree: Size of the pipeline parallel group.
        tp_degree: Size of the tensor parallel group.
        world_size: Total number of ranks that participate in training.
        server_url: URL of the Perseus server.
        job_metadata: An optional arbitrary string that describes the job. This will
            be appended to the job ID if given. Typically for logging purposes.
    """

Here, is device_id reindexed to respect CUDA_VISIBLE_DEVICES?

for example, if there are four GPUs on system and CUDA_VISIBLE_DEVICES="0,2", to access GPU with CUDA_VISIBLE_DEVICE "2", does it reindex it to be 1 or keep it as 2? Because ZeusMonitor reindexes, but from the comment above I wasn't sure

^ after doing some research, I assume its reindexed. Finished optimizer.py with this assumption.

jaywonchung commented 3 months ago

Correct, 1 is the index that will map to NVML device 2. The test is failing because you imported torch in gpu.py. We actually avoid that in Zeus because we want people to be able to use Zeus without having PyTorch installed -- Someone might be using JAX, someone might just want to measure GPU energy and they're not even doing deep learning, etc. That's why we do this: https://github.com/ml-energy/zeus/blob/f9275786fd98920b3bf98485c78f9e15b2304364/zeus/util/framework.py#L29-L42

jaywonchung commented 3 months ago

Use the "Add to batch" feature

jaywonchung commented 3 months ago

Apparently this is what's causing doc build failures. Don't worry about doc failing and I think it'll be resolved automatically soon.

parthraut commented 3 months ago

Ready for review!

Completed tasks:

✅ remove cuda_sync and setDevice from gpu, put it back where it belongs ✅ changed gpu/get_gpu to torch_is_available()-like to make it scalable

✅ Fixed set Persistence Mode, added TODO comment back to where set Persistence mode SYS_admin check was ✅ included ability to turn off persistence mode

✅ Fixed set PowerManagementLimit - originally broke API consistency. Added resetPowerManagementLimit function. ✅ Fixed: To check ensure homogeneous, just convert to set and check len larger than 1 ✅ declare self.gpus as an abstract property to ensure it exists

✅ Fixed names of exceptions ✅ Fall back to ZeusGPUUnknown Error

BIG fixes: ✅ move lru_cache stuff to gpu Class ✅ Fix tests, and remove gpu_indicies

STYLE Fixes: ✅ Module-level docstring in zeus/init.py and zeus/device/init.py ✅ Fixed docstrings ✅ Removed all private attributes from docstrings

✅ made sure signature of init all match ✅ init should come First ✅ mention unit for each method

Questions and notes: ❓ It looks like it is failing on github actions, but it works perfectly locally. I'm not sure what exactly is the issue. ❓ What do you think about the names of exceptions?

jaywonchung commented 3 months ago

Exception names look good! But is there a need to export all of the exceptions in zeus/devices/__init__.py? I think just the core API like get_gpu should be exported and other things kept inside zeus.device.gpu.

parthraut commented 3 months ago

Exception names look good! But is there a need to export all of the exceptions in zeus/devices/__init__.py? I think just the core API like get_gpu should be exported and other things kept inside zeus.device.gpu.

So initially I was only exporting the exceptions used in the source code. But you said not to selectively export some of them, so I added all of them.