Closed saforem2 closed 8 months ago
This fixes things when running locally (i.e. without a scheduler (PBS, Slurm, ...) e.g. on my MacBook:
scheduler
$ mpirun -np 2 python3 -m ezpz framework=pytorch backend=DDP use_wandb=true [2024-01-22 13:10:57][INFO][dist:257] - DistInfo={ "DEVICE": "mps", "DEVICE_ID": "mps:0", "DISTRIBUTED_BACKEND": "gloo", "GPUS_PER_NODE": 12, "HOSTFILE": "/Users/samforeman/projects/saforem2/ezpz/src/ezpz/outputs/runs/pytorch/DDP/2024-01-22/13-10-57/hostfile", "HOSTNAME": "localhost", "HOSTS": "['localhost']", "LOCAL_RANK": 0, "MACHINE": "localhost", "NGPUS": 12, "NODE_ID": 0, "NUM_NODES": 1, "RANK": 0, "SCHEDULER": "LOCAL", "WORLD_SIZE_IN_USE": 2, "WORLD_SIZE_TOTAL": 12 } [2024-01-22 13:10:57][INFO][dist:642] - [0/2] Using device='mps' with backend='DDP' + 'gloo' for distributed training. [2024-01-22 13:10:57][INFO][dist:313] - [device='mps'][rank=0/1][local_rank=0/11][node=0/0] [2024-01-22 13:10:57][WARNING][dist:314] - Using [2 / 12] available "mps" devices !! [2024-01-22 13:10:57][INFO][dist:789] - Setting up wandb from rank: 0 [2024-01-22 13:10:57][INFO][dist:790] - Using: WB PROJECT: ezpz wandb: Currently logged in as: saforem2 (l2hmc-qcd). Use `wandb login --relogin` to force relogin wandb: wandb version 0.16.2 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade wandb: Tracking run with wandb version 0.16.0 wandb: Run data is saved locally in /Users/samforeman/projects/saforem2/ezpz/src/ezpz/outputs/runs/pytorch/DDP/2024-01-22/13-10-57/wandb/run-20240122_131058-8jc7852e wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run major-frost-230 wandb: ⭐️ View project at https://wandb.ai/l2hmc-qcd/ezpz wandb: 🚀 View run at https://wandb.ai/l2hmc-qcd/ezpz/runs/8jc7852e [2024-01-22 13:10:59][INFO][dist:820] - W&B RUN: [major-frost-230](https://wandb.ai/l2hmc-qcd/ezpz/runs/8jc7852e) [2024-01-22 13:10:59][INFO][dist:848] - Running on machine='localhost' [2024-01-22 13:10:59][INFO][__main__:86] - config=TrainConfig(use_wandb=True, wandb_project_name='ezpz') [2024-01-22 13:10:59][INFO][__main__:87] - Output dir: /Users/samforeman/projects/saforem2/ezpz/src/ezpz/outputs/runs/pytorch/DDP/2024-01-22/13-10-57 [2024-01-22 13:10:59][WARNING][__main__:97] - Startup time: 0.2394669170025736 [2024-01-22 13:10:59][WARNING][__main__:99] - 🚀 [major-frost-230](https://wandb.ai/l2hmc-qcd/ezpz/runs/8jc7852e) [2024-01-22 13:10:59][INFO][dist:123] - `main` took: dt=1.8236s wandb: WARNING No program path found, not creating job artifact. See https://docs.wandb.ai/guides/launch/create-job wandb: wandb: Run history: wandb: startup_time ▁ wandb: timeit/main ▁ wandb: wandb: Run summary: wandb: startup_time 0.23947 wandb: timeit/main 1.82361 wandb: wandb: 🚀 View run major-frost-230 at: https://wandb.ai/l2hmc-qcd/ezpz/runs/8jc7852e wandb: Synced 5 W&B file(s), 0 media file(s), 436 artifact file(s) and 0 other file(s) wandb: Find logs at: ./wandb/run-20240122_131058-8jc7852e/logs
slurm
PBS <--> SLURM
This fixes things when running locally (i.e. without a
scheduler
(PBS, Slurm, ...) e.g. on my MacBook:slurm
support. This should be pretty straightforward, just need to look back up myPBS <--> SLURM
environment variable conversions 😂