saforem2 / ezpz

Train across all your devices, ezpz 🍋
https://saforem2.github.io/ezpz/
MIT License
9 stars 2 forks source link

Fix python interface when no scheduler present #6

Closed saforem2 closed 8 months ago

saforem2 commented 8 months ago

This fixes things when running locally (i.e. without a scheduler (PBS, Slurm, ...) e.g. on my MacBook:

$ mpirun -np 2 python3 -m ezpz framework=pytorch backend=DDP use_wandb=true
[2024-01-22 13:10:57][INFO][dist:257] - DistInfo={
    "DEVICE": "mps",
    "DEVICE_ID": "mps:0",
    "DISTRIBUTED_BACKEND": "gloo",
    "GPUS_PER_NODE": 12,
    "HOSTFILE": "/Users/samforeman/projects/saforem2/ezpz/src/ezpz/outputs/runs/pytorch/DDP/2024-01-22/13-10-57/hostfile",
    "HOSTNAME": "localhost",
    "HOSTS": "['localhost']",
    "LOCAL_RANK": 0,
    "MACHINE": "localhost",
    "NGPUS": 12,
    "NODE_ID": 0,
    "NUM_NODES": 1,
    "RANK": 0,
    "SCHEDULER": "LOCAL",
    "WORLD_SIZE_IN_USE": 2,
    "WORLD_SIZE_TOTAL": 12
}
[2024-01-22 13:10:57][INFO][dist:642] - [0/2] Using device='mps' with backend='DDP' + 'gloo' for distributed training.
[2024-01-22 13:10:57][INFO][dist:313] - [device='mps'][rank=0/1][local_rank=0/11][node=0/0]
[2024-01-22 13:10:57][WARNING][dist:314] - Using [2 / 12] available "mps" devices !!
[2024-01-22 13:10:57][INFO][dist:789] - Setting up wandb from rank: 0
[2024-01-22 13:10:57][INFO][dist:790] - Using: WB PROJECT: ezpz
wandb: Currently logged in as: saforem2 (l2hmc-qcd). Use `wandb login --relogin` to force relogin
wandb: wandb version 0.16.2 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.16.0
wandb: Run data is saved locally in /Users/samforeman/projects/saforem2/ezpz/src/ezpz/outputs/runs/pytorch/DDP/2024-01-22/13-10-57/wandb/run-20240122_131058-8jc7852e
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run major-frost-230
wandb: ⭐️ View project at https://wandb.ai/l2hmc-qcd/ezpz
wandb: 🚀 View run at https://wandb.ai/l2hmc-qcd/ezpz/runs/8jc7852e
[2024-01-22 13:10:59][INFO][dist:820] - W&B RUN: [major-frost-230](https://wandb.ai/l2hmc-qcd/ezpz/runs/8jc7852e)
[2024-01-22 13:10:59][INFO][dist:848] - Running on machine='localhost'
[2024-01-22 13:10:59][INFO][__main__:86] - config=TrainConfig(use_wandb=True, wandb_project_name='ezpz')
[2024-01-22 13:10:59][INFO][__main__:87] - Output dir: /Users/samforeman/projects/saforem2/ezpz/src/ezpz/outputs/runs/pytorch/DDP/2024-01-22/13-10-57
[2024-01-22 13:10:59][WARNING][__main__:97] - Startup time: 0.2394669170025736
[2024-01-22 13:10:59][WARNING][__main__:99] - 🚀 [major-frost-230](https://wandb.ai/l2hmc-qcd/ezpz/runs/8jc7852e)
[2024-01-22 13:10:59][INFO][dist:123] - `main` took: dt=1.8236s
wandb: WARNING No program path found, not creating job artifact. See https://docs.wandb.ai/guides/launch/create-job
wandb:
wandb: Run history:
wandb: startup_time ▁
wandb:  timeit/main ▁
wandb:
wandb: Run summary:
wandb: startup_time 0.23947
wandb:  timeit/main 1.82361
wandb:
wandb: 🚀 View run major-frost-230 at: https://wandb.ai/l2hmc-qcd/ezpz/runs/8jc7852e
wandb: Synced 5 W&B file(s), 0 media file(s), 436 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20240122_131058-8jc7852e/logs