saforem2 / ezpz

Train across all your devices, ezpz 🍋
https://saforem2.github.io/ezpz/
MIT License
9 stars 2 forks source link

Merge `ezpz-profile` into `main` #12

Closed saforem2 closed 3 months ago

saforem2 commented 3 months ago

This adds ezpz/profile.py for ez profiling.

example:

# test.py

def main():
    print("Hello!")
    from ezpz.profile import get_context_manager
    # NOTE: 
    # 1. if `rank` is passed to `get_context_manager`:
    #        - it will ONLY be instantiated if rank == 0,
    #          otherwise, it will return a contextlib.nullcontext() instance.
    # 2. if `strict=True`:
    #        - only run if "PYINSTRUMENT_PROFILER=1" in environment
    cm = get_context_manager(rank=RANK, strict=True)
    with cm:
        main()

if __name__ == '__main__':
    main()

then run with:

PYINSTRUMENT_PROFILER=1 python test.py

or, to use the existing ezpz/test_dist.py,

PYINSTRUMENT_PROFILER=1 launch python3 -m ezpz.test_dist
profiler output: ```txt _ ._ __/__ _ _ _ _ _/_ Recorded: 12:10:12 Samples: 462 /_//_/// /_\ / //_// / //_'/ // Duration: 2.583 CPU time: 2.837 / _/ v4.6.2 Program: /lus/gecko/projects/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/ezpz/src/ezpz/test_dist.py 2.583 ezpz/test_dist.py:1 `- 2.581 main ezpz/test_dist.py:150 |- 0.768 DistributedDataParallel._wrapped_call_impl torch/nn/modules/module.py:1514 | [6 frames hidden] torch | 0.757 Network._call_impl torch/nn/modules/module.py:1520 | `- 0.741 Network.forward ezpz/test_dist.py:116 | `- 0.741 Sequential._wrapped_call_impl torch/nn/modules/module.py:1514 | [7 frames hidden] torch, | 0.740 linear |- 0.652 DistributedDataParallel.__init__ torch/nn/parallel/distributed.py:630 | [6 frames hidden] torch, | 0.603 PyCapsule._verify_params_across_processes |- 0.615 Tensor.backward torch/_tensor.py:433 | [3 frames hidden] torch, | 0.594 _EngineBase.run_backward |- 0.181 tplot_dict ezpz/test_dist.py:124 | |- 0.110 show plotext/_core.py:292 | | [6 frames hidden] plotext | `- 0.037 plotext/__init__.py:1 | [2 frames hidden] plotext |- 0.102 Logger.info logging/__init__.py:1436 | [7 frames hidden] logging, rich | 0.080 RichHandler.render rich/logging.py:199 | `- 0.063 FluidLogRender.__call__ ezpz/log/handler.py:79 | `- 0.039 Text.__add__ rich/text.py:178 | [2 frames hidden] rich |- 0.056 wrapper torch/optim/optimizer.py:356 | [5 frames hidden] torch |- 0.051 Tensor.item |- 0.047 calc_loss ezpz/test_dist.py:120 |- 0.032 Tensor.to `- 0.028 _VariableFunctionsClass.rand ```
traceback: ```bash # [🌌][12:09:47 PM][foremans@x4711c1s2b0n0][…/Megatron-DeepSpeed][🌱 main][$][aurora_nre_models_frameworks-2024.1] (aurora_nre_models_frameworks-2024.1) $ NUMEXPR_MAX_THREADS=16 PYINSTRUMENT_PROFILER=1 launch python3 -m ezpz.test_dist Connected to tcp://x4711c1s2b0n0.hostmgmt2711.cm.aurora.alcf.anl.gov:7919 Found executable /gecko/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.1/bin/python3 Launching application 00fdff5d-6c02-4df4-9949-6b346d84b66e [2024-06-21 12:10:09][INFO][dist:308] - [device='xpu'][rank=1/23][local_rank=1/11][node=1/1] [2024-06-21 12:10:09][INFO][dist:308] - [device='xpu'][rank=8/23][local_rank=8/11][node=0/1] [2024-06-21 12:10:09][INFO][dist:308] - [device='xpu'][rank=12/23][local_rank=0/11][node=0/1] [2024-06-21 12:10:09][INFO][dist:308] - [device='xpu'][rank=2/23][local_rank=2/11][node=0/1] [2024-06-21 12:10:09][INFO][dist:308] - [device='xpu'][rank=14/23][local_rank=2/11][node=0/1] [2024-06-21 12:10:09][INFO][dist:308] - [device='xpu'][rank=15/23][local_rank=3/11][node=1/1] [2024-06-21 12:10:09][INFO][dist:308] - [device='xpu'][rank=16/23][local_rank=4/11][node=0/1] [2024-06-21 12:10:09][INFO][dist:308] - [device='xpu'][rank=3/23][local_rank=3/11][node=1/1] [2024-06-21 12:10:09][INFO][dist:308] - [device='xpu'][rank=17/23][local_rank=5/11][node=1/1] [2024-06-21 12:10:09][INFO][dist:308] - [device='xpu'][rank=4/23][local_rank=4/11][node=0/1] [2024-06-21 12:10:09][INFO][dist:308] - [device='xpu'][rank=18/23][local_rank=6/11][node=0/1] [2024-06-21 12:10:09][INFO][dist:308] - [device='xpu'][rank=22/23][local_rank=10/11][node=0/1] [2024-06-21 12:10:09][INFO][dist:308] - [device='xpu'][rank=5/23][local_rank=5/11][node=1/1] [2024-06-21 12:10:09][INFO][dist:308] - [device='xpu'][rank=13/23][local_rank=1/11][node=1/1] [2024-06-21 12:10:09][INFO][dist:308] - [device='xpu'][rank=6/23][local_rank=6/11][node=0/1] [2024-06-21 12:10:09][INFO][dist:308] - [device='xpu'][rank=19/23][local_rank=7/11][node=1/1] [2024-06-21 12:10:09][INFO][dist:308] - [device='xpu'][rank=7/23][local_rank=7/11][node=1/1] [2024-06-21 12:10:09][INFO][dist:308] - [device='xpu'][rank=20/23][local_rank=8/11][node=0/1] [2024-06-21 12:10:09][INFO][dist:308] - [device='xpu'][rank=21/23][local_rank=9/11][node=1/1] [2024-06-21 12:10:09][INFO][dist:308] - [device='xpu'][rank=9/23][local_rank=9/11][node=1/1] [2024-06-21 12:10:09][INFO][dist:308] - [device='xpu'][rank=23/23][local_rank=11/11][node=1/1] [2024-06-21 12:10:09][INFO][dist:308] - [device='xpu'][rank=10/23][local_rank=10/11][node=0/1] [2024-06-21 12:10:09][INFO][dist:308] - [device='xpu'][rank=11/23][local_rank=11/11][node=1/1] [2024-06-21 12:10:09][INFO][dist:257] - DistInfo={ "DEVICE": "xpu", "DEVICE_ID": "xpu:0", "DISTRIBUTED_BACKEND": "ccl", "GPUS_PER_NODE": 12, "HOSTFILE": "/var/spool/pbs/aux/684084.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov", "HOSTNAME": "x4711c1s2b0n0.hostmgmt2711.cm.aurora.alcf.anl.gov", "HOSTS": "['x4711c1s2b0n0', 'x4711c1s3b0n0']", "LOCAL_RANK": 0, "MACHINE": "Aurora", "NGPUS": 24, "NODE_ID": 0, "NUM_NODES": 2, "RANK": 0, "SCHEDULER": "PBS", "WORLD_SIZE_IN_USE": 24, "WORLD_SIZE_TOTAL": 24 } [2024-06-21 12:10:09][INFO][dist:671] - Using oneccl_bindings from: /gecko/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.1/lib/python3.9/site-packages/oneccl_bindings_for_pytorch/__init__.py [2024-06-21 12:10:09][INFO][dist:673] - Using ipex from: /opt/aurora/24.086.0/frameworks/aurora_nre_models_frameworks-2024.1/lib/python3.9/site-packages/intel_extension_for_pytorch/__init__.py [2024-06-21 12:10:09][INFO][dist:674] - [0/24] Using device='xpu' with backend='DDP' + 'ccl' for distributed training. [2024-06-21 12:10:09][INFO][dist:308] - [device='xpu'][rank=0/23][local_rank=0/11][node=0/1] [2024-06-21 12:10:09][WARNING][dist:314] - Using [24 / 24] available "xpu" devices !! [2024-06-21 12:10:09][INFO][dist:820] - Setting up wandb from rank: 0 [2024-06-21 12:10:09][INFO][dist:821] - Using: WB PROJECT: ezpz.test_dist wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information. wandb: Currently logged in as: foremans (aurora_gpt). Use `wandb login --relogin` to force relogin wandb: Tracking run with wandb version 0.17.2 wandb: Run data is saved locally in /lus/gecko/projects/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/wandb/run-20240621_121010-88frtmhf wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run fanciful-sky-91 wandb: View project at https://wandb.ai/aurora_gpt/ezpz.test_dist wandb: View run at https://wandb.ai/aurora_gpt/ezpz.test_dist/runs/88frtmhf [2024-06-21 12:10:12][INFO][dist:851] - W&B RUN: [fanciful-sky-91](https://wandb.ai/aurora_gpt/ezpz.test_dist/runs/88frtmhf) [2024-06-21 12:10:12][INFO][dist:884] - Running on machine='Aurora' [2024-06-21 12:10:12][INFO][test_dist:161] - model=Network( (layers): Sequential( (0): Linear(in_features=128, out_features=1024, bias=True) (1): Linear(in_features=1024, out_features=512, bias=True) (2): Linear(in_features=512, out_features=256, bias=True) (3): Linear(in_features=256, out_features=128, bias=True) (4): Linear(in_features=128, out_features=128, bias=True) ) ) 2024:06:21-12:10:12:(89301) |CCL_WARN| MPI was initialized externally, CCL-MPI specific environment is ignored [2024-06-21 12:10:14][INFO][test_dist:232] - iter=1, loss=2004.6644, dt=0.0316, dtf=0.0019842, dtb=0.0295805 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=2, loss=1451.3967, dt=0.0331, dtf=0.000990431, dtb=0.0321032 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=3, loss=1110.7493, dt=0.0032, dtf=0.000625944, dtb=0.00255556 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=4, loss=960.1414, dt=0.0034, dtf=0.0007132, dtb=0.00267326 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=5, loss=871.8055, dt=0.0033, dtf=0.000667956, dtb=0.00264322 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=6, loss=811.2060, dt=0.0031, dtf=0.000724842, dtb=0.00239383 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=7, loss=740.5243, dt=0.0031, dtf=0.000695271, dtb=0.00242035 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=8, loss=733.9873, dt=0.0032, dtf=0.000667894, dtb=0.00255756 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=9, loss=725.9017, dt=0.0032, dtf=0.000733878, dtb=0.00242794 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=10, loss=715.6084, dt=0.0031, dtf=0.000693919, dtb=0.00245252 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=11, loss=703.1045, dt=0.0031, dtf=0.000656381, dtb=0.00241897 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=12, loss=701.2744, dt=0.0030, dtf=0.00064129, dtb=0.00238017 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=13, loss=673.5835, dt=0.0030, dtf=0.000624069, dtb=0.00236501 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=14, loss=678.6807, dt=0.0030, dtf=0.000617655, dtb=0.00233964 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=15, loss=659.2550, dt=0.0030, dtf=0.000631331, dtb=0.00235357 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=16, loss=647.4363, dt=0.0041, dtf=0.00061722, dtb=0.00345485 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=17, loss=648.1597, dt=0.0031, dtf=0.000627205, dtb=0.00243802 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=18, loss=642.3028, dt=0.0041, dtf=0.000649736, dtb=0.00342747 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=19, loss=634.6727, dt=0.0030, dtf=0.00067944, dtb=0.0023197 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=20, loss=647.7621, dt=0.0030, dtf=0.000627795, dtb=0.00240372 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=21, loss=714.9183, dt=0.0030, dtf=0.000630177, dtb=0.00239472 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=22, loss=681.5583, dt=0.0035, dtf=0.00108759, dtb=0.00236911 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=23, loss=618.4785, dt=0.0030, dtf=0.000624657, dtb=0.00235128 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=24, loss=683.2552, dt=0.0030, dtf=0.000609562, dtb=0.0023567 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=25, loss=594.4351, dt=0.0030, dtf=0.000636287, dtb=0.00236924 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=26, loss=640.7553, dt=0.0030, dtf=0.00063579, dtb=0.00234961 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=27, loss=599.3002, dt=0.0031, dtf=0.000641096, dtb=0.00242388 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=28, loss=631.6345, dt=0.0030, dtf=0.000638334, dtb=0.00235881 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=29, loss=597.1453, dt=0.0030, dtf=0.000646109, dtb=0.00236585 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=30, loss=602.5551, dt=0.0030, dtf=0.00065652, dtb=0.00237217 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=31, loss=607.9491, dt=0.0030, dtf=0.000627367, dtb=0.0023948 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=32, loss=583.3282, dt=0.0030, dtf=0.000598328, dtb=0.00236966 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=33, loss=601.2754, dt=0.0030, dtf=0.000644719, dtb=0.00233818 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=34, loss=572.3402, dt=0.0167, dtf=0.00063443, dtb=0.0160382 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=35, loss=569.4097, dt=0.0226, dtf=0.000607673, dtb=0.0219853 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=36, loss=583.9858, dt=0.0032, dtf=0.000619026, dtb=0.00262442 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=37, loss=556.6193, dt=0.0029, dtf=0.000596718, dtb=0.00225504 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=38, loss=566.3671, dt=0.0030, dtf=0.00061672, dtb=0.00236669 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=39, loss=549.7924, dt=0.0031, dtf=0.000657198, dtb=0.00242559 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=40, loss=543.8108, dt=0.0030, dtf=0.000601213, dtb=0.00241252 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=41, loss=560.7616, dt=0.0030, dtf=0.000598999, dtb=0.00238951 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=42, loss=533.6393, dt=0.0030, dtf=0.000632787, dtb=0.00236811 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=43, loss=539.0911, dt=0.0031, dtf=0.000630492, dtb=0.00245194 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=44, loss=547.3221, dt=0.0030, dtf=0.000610587, dtb=0.00243129 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=45, loss=525.8107, dt=0.0030, dtf=0.000615355, dtb=0.00239578 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=46, loss=527.2558, dt=0.0031, dtf=0.000627453, dtb=0.00244288 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=47, loss=528.7238, dt=0.0030, dtf=0.000603536, dtb=0.00236802 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=48, loss=513.9382, dt=0.0199, dtf=0.000593784, dtb=0.0192791 [2024-06-21 12:10:14][INFO][test_dist:232] - iter=49, loss=516.4860, dt=0.0038, dtf=0.000651341, dtb=0.00319233 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=50, loss=508.9355, dt=0.0276, dtf=0.000647237, dtb=0.0269818 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=51, loss=517.7874, dt=0.0035, dtf=0.000596273, dtb=0.00288301 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=52, loss=515.1092, dt=0.0032, dtf=0.000629139, dtb=0.00254777 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=53, loss=500.3201, dt=0.0032, dtf=0.00065106, dtb=0.00251022 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=54, loss=497.5414, dt=0.0032, dtf=0.000616243, dtb=0.00259856 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=55, loss=492.1354, dt=0.0032, dtf=0.000632951, dtb=0.00254589 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=56, loss=496.2526, dt=0.0031, dtf=0.00062147, dtb=0.00247057 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=57, loss=484.1946, dt=0.0031, dtf=0.000630126, dtb=0.00246136 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=58, loss=482.0441, dt=0.0030, dtf=0.000623369, dtb=0.00236744 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=59, loss=480.4559, dt=0.0030, dtf=0.000628713, dtb=0.00239398 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=60, loss=483.4744, dt=0.0030, dtf=0.0006242, dtb=0.00240625 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=61, loss=479.2868, dt=0.0038, dtf=0.000636492, dtb=0.00320391 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=62, loss=466.0627, dt=0.0040, dtf=0.000621026, dtb=0.00338207 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=63, loss=462.7208, dt=0.0031, dtf=0.000694227, dtb=0.0023627 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=64, loss=465.3651, dt=0.0030, dtf=0.00064679, dtb=0.00235774 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=65, loss=460.9333, dt=0.0030, dtf=0.000621029, dtb=0.0024114 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=66, loss=458.8359, dt=0.0030, dtf=0.000627817, dtb=0.00237667 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=67, loss=446.4143, dt=0.0029, dtf=0.000598158, dtb=0.00233594 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=68, loss=449.1019, dt=0.0029, dtf=0.000623896, dtb=0.00231658 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=69, loss=446.2694, dt=0.0030, dtf=0.000657572, dtb=0.00234896 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=70, loss=442.3687, dt=0.0030, dtf=0.000637657, dtb=0.00237849 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=71, loss=434.9398, dt=0.0030, dtf=0.000643431, dtb=0.00236885 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=72, loss=429.9624, dt=0.0029, dtf=0.000614362, dtb=0.00231131 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=73, loss=438.6733, dt=0.0030, dtf=0.000627421, dtb=0.00232736 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=74, loss=427.8250, dt=0.0030, dtf=0.000662674, dtb=0.00229828 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=75, loss=427.4401, dt=0.0030, dtf=0.000620318, dtb=0.00237275 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=76, loss=420.4407, dt=0.0030, dtf=0.000655161, dtb=0.00234482 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=77, loss=416.7180, dt=0.0032, dtf=0.000661927, dtb=0.00256547 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=78, loss=404.3629, dt=0.0040, dtf=0.000632358, dtb=0.0034024 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=79, loss=420.7403, dt=0.0030, dtf=0.000618139, dtb=0.00236582 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=80, loss=408.6548, dt=0.0030, dtf=0.000600819, dtb=0.00236629 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=81, loss=406.2084, dt=0.0030, dtf=0.000631967, dtb=0.0023829 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=82, loss=405.6879, dt=0.0030, dtf=0.000657659, dtb=0.00237876 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=83, loss=400.6847, dt=0.0030, dtf=0.000653046, dtb=0.00233331 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=84, loss=388.3703, dt=0.0030, dtf=0.000640595, dtb=0.00234525 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=85, loss=389.8710, dt=0.0030, dtf=0.000629979, dtb=0.0023508 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=86, loss=387.8036, dt=0.0030, dtf=0.00063042, dtb=0.00235516 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=87, loss=376.2745, dt=0.0030, dtf=0.000652216, dtb=0.00237224 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=88, loss=381.0520, dt=0.0029, dtf=0.000610267, dtb=0.00233583 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=89, loss=390.6617, dt=0.0030, dtf=0.000663042, dtb=0.0023021 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=90, loss=386.1241, dt=0.0030, dtf=0.000652819, dtb=0.00239643 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=91, loss=357.6912, dt=0.0030, dtf=0.000633117, dtb=0.00237358 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=92, loss=376.0060, dt=0.0037, dtf=0.000637883, dtb=0.0030281 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=93, loss=365.3871, dt=0.0030, dtf=0.000614615, dtb=0.00237839 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=94, loss=367.1737, dt=0.0031, dtf=0.000631401, dtb=0.00249881 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=95, loss=370.1765, dt=0.0030, dtf=0.000615096, dtb=0.0024112 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=96, loss=363.5524, dt=0.0030, dtf=0.000643134, dtb=0.00240172 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=97, loss=355.1588, dt=0.0030, dtf=0.000642463, dtb=0.00240671 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=98, loss=352.2323, dt=0.0030, dtf=0.000608163, dtb=0.00236698 [2024-06-21 12:10:15][INFO][test_dist:232] - iter=99, loss=344.9492, dt=0.0338, dtf=0.0172624, dtb=0.0165525 train/dt [2024-06-21-121015] ┌────────────────────────────────────────────────────────────────────────┐ 0.0338┤▗ ▝│ │▖ │ │ │ 0.0287┤ ▖ │ │ │ │ │ 0.0235┤ ▖ │ │ │ 0.0183┤ ▝ │ │ ▖ │ │ │ 0.0132┤ │ │ │ │ │ 0.0080┤ │ │ │ │ │ 0.0029┤ ▗▘▚▗▖▄▗▖▄▖▚▝▖▄▝▖▄▖▄▗▖▄▗ ▗▗▄▗▖▄▗▖▄ ▘▝▖▄▗▖▄▗▖▀▖▄▗▖▄▗▖▄▗▄▗▘▄▗▖▄▗▄▗▖▄▝▖▄▗▖▖│ └┬─────────────────┬─────────────────┬────────────────┬─────────────────┬┘ 1.0 25.5 50.0 74.5 99.0 train/dt iter [2024-06-21 12:10:15][INFO][test_dist:144] - Appending plot to: /lus/gecko/projects/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/test-dist-plots/train/dt.txt text saved in /lus/gecko/projects/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/test-dist-plots/train/dt.txt train/dtf [2024-06-21-121015] ┌────────────────────────────────────────────────────────────────────────┐ 0.0173┤ ▝│ │ │ │ │ 0.0145┤ │ │ │ │ │ 0.0117┤ │ │ │ 0.0089┤ │ │ │ │ │ 0.0061┤ │ │ │ │ │ 0.0034┤ │ │ │ │▘ │ 0.0006┤▝▗▖▄▗▖▄▗▖▄▖▄▗▖▄▝▖▄▖▄▗▖▄▗▖▄▗▄▗▖▄▗▖▄▗▖▄▖▄▗▖▄▗▖▄▖▄▗▖▄▗▖▄▗▄▗▖▄▗▖▄▗▄▗▖▄▗▖▄▗▖▖│ └┬─────────────────┬─────────────────┬────────────────┬─────────────────┬┘ 1.0 25.5 50.0 74.5 99.0 train/dtf iter [2024-06-21 12:10:15][INFO][test_dist:144] - Appending plot to: /lus/gecko/projects/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/test-dist-plots/train/dtf.txt text saved in /lus/gecko/projects/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/test-dist-plots/train/dtf.txt train/dtb [2024-06-21-121015] ┌────────────────────────────────────────────────────────────────────────┐ 0.0321┤▝ │ │▖ │ │ │ 0.0271┤ ▘ │ │ │ │ │ 0.0222┤ ▘ │ │ ▗ │ 0.0172┤ │ │ ▖ ▝│ │ │ 0.0122┤ │ │ │ │ │ 0.0072┤ │ │ │ │ │ 0.0023┤ ▗▖▄▗▖▄▗▖▄▖▚▝▖▄▗▖▄▖▄▗▖▄▗ ▗▗▄▗▖▄▗▖▄ ▘▝▖▄▗▖▄▗▖▀▖▄▗▖▄▗▖▄▗▄▗▘▄▗▖▄▗▄▗▖▄▝▖▄▗▖▖│ └┬─────────────────┬─────────────────┬────────────────┬─────────────────┬┘ 1.0 25.5 50.0 74.5 99.0 train/dtb iter [2024-06-21 12:10:15][INFO][test_dist:144] - Appending plot to: /lus/gecko/projects/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/test-dist-plots/train/dtb.txt text saved in /lus/gecko/projects/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/test-dist-plots/train/dtb.txt train/loss [2024-06-21-121015] ┌────────────────────────────────────────────────────────────────────────┐ 2004.7┤▘ │ │ │ │ │ 1728.0┤ │ │ │ │ │ 1451.4┤▝ │ │ │ 1174.8┤ │ │ ▗ │ │ │ 898.2┤ ▘ │ │ ▚ │ │ ▗▖▄▗▖ ▗ │ 621.6┤ ▀▘▄▗▖▖▝▖▘▖▗ ▖ │ │ ▝ ▘▝▘▝▝▘▀▗▚▗▘▄▗▖▄▗▖▗▖ │ │ ▘ ▀▝▘▀▝▘▀▖▚▗▖▄▗▖▄▗▄▗ ▖ │ 344.9┤ ▘▝▝▘▀▝▀▝▘▚▝▖▞▗▖▄│ └┬─────────────────┬─────────────────┬────────────────┬─────────────────┬┘ 1.0 25.5 50.0 74.5 99.0 train/loss iter [2024-06-21 12:10:15][INFO][test_dist:144] - Appending plot to: /lus/gecko/projects/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/test-dist-plots/train/loss.txt text saved in /lus/gecko/projects/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/test-dist-plots/train/loss.txt train/iter [2024-06-21-121015] ┌──────────────────────────────────────────────────────────────────────────┐ 99.0┤ ▗▗▖▀│ │ ▄▝▘▘ │ │ ▗▖▞▝▘ │ 82.7┤ ▄▗▘▀ │ │ ▖▄▝▘ │ │ ▗▗▖▀▝ │ 66.3┤ ▄▝▘▘ │ │ ▗▖▞▝▘ │ 50.0┤ ▄▗▘▀ │ │ ▖▄▝▘ │ │ ▗▗▖▀▝ │ 33.7┤ ▄▝▘▘ │ │ ▗▖▞▝▘ │ │ ▄▗▘▀ │ 17.3┤ ▖▄▝▘ │ │ ▗▗▖▀▝ │ │ ▄▝▘▘ │ 1.0┤▖▞▝▘ │ └┬─────────────────┬──────────────────┬─────────────────┬─────────────────┬┘ 1.0 25.5 50.0 74.5 99.0 train/iter iter [2024-06-21 12:10:15][INFO][test_dist:144] - Appending plot to: /lus/gecko/projects/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/test-dist-plots/train/iter.txt text saved in /lus/gecko/projects/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/test-dist-plots/train/iter.txt _ ._ __/__ _ _ _ _ _/_ Recorded: 12:10:12 Samples: 462 /_//_/// /_\ / //_// / //_'/ // Duration: 2.583 CPU time: 2.837 / _/ v4.6.2 Program: /lus/gecko/projects/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/ezpz/src/ezpz/test_dist.py 2.583 ezpz/test_dist.py:1 `- 2.581 main ezpz/test_dist.py:150 |- 0.768 DistributedDataParallel._wrapped_call_impl torch/nn/modules/module.py:1514 | [6 frames hidden] torch | 0.757 Network._call_impl torch/nn/modules/module.py:1520 | `- 0.741 Network.forward ezpz/test_dist.py:116 | `- 0.741 Sequential._wrapped_call_impl torch/nn/modules/module.py:1514 | [7 frames hidden] torch, | 0.740 linear |- 0.652 DistributedDataParallel.__init__ torch/nn/parallel/distributed.py:630 | [6 frames hidden] torch, | 0.603 PyCapsule._verify_params_across_processes |- 0.615 Tensor.backward torch/_tensor.py:433 | [3 frames hidden] torch, | 0.594 _EngineBase.run_backward |- 0.181 tplot_dict ezpz/test_dist.py:124 | |- 0.110 show plotext/_core.py:292 | | [6 frames hidden] plotext | `- 0.037 plotext/__init__.py:1 | [2 frames hidden] plotext |- 0.102 Logger.info logging/__init__.py:1436 | [7 frames hidden] logging, rich | 0.080 RichHandler.render rich/logging.py:199 | `- 0.063 FluidLogRender.__call__ ezpz/log/handler.py:79 | `- 0.039 Text.__add__ rich/text.py:178 | [2 frames hidden] rich |- 0.056 wrapper torch/optim/optimizer.py:356 | [5 frames hidden] torch |- 0.051 Tensor.item |- 0.047 calc_loss ezpz/test_dist.py:120 |- 0.032 Tensor.to `- 0.028 _VariableFunctionsClass.rand [2024-06-21 12:10:15][INFO][profile:115] - Saving pyinstrument profile output to: /lus/gecko/projects/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed [2024-06-21 12:10:15][INFO][profile:123] - PyInstrument profile saved (as html) to: pyinstrument-profile-2024-06-21-121015.html [2024-06-21 12:10:15][INFO][profile:131] - PyInstrument profile saved (as text) to: pyinstrument-profile-2024-06-21-121015.txt [2024-06-21 12:10:15][INFO][profile:143] - Finished with pyinstrument profiler. Took: 2.58299s wandb: - 0.505 MB of 0.505 MB uploaded wandb: wandb: wandb: Run summary: wandb: timers/ezpz.setup_torch 0.31212 wandb: timers/imports 6e-05 wandb: timers/init_to_first_step 4.12052 wandb: timers/runtime 6.17023 wandb: train/dt 0.03381 wandb: train/dtb 0.01655 wandb: train/dtf 0.01726 wandb: train/iter 99 wandb: train/loss 344.94916 wandb: wandb: View run fanciful-sky-91 at: https://wandb.ai/aurora_gpt/ezpz.test_dist/runs/88frtmhf wandb: View project at: https://wandb.ai/aurora_gpt/ezpz.test_dist wandb: Synced 7 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) wandb: Find logs at: ./wandb/run-20240621_121010-88frtmhf/logs Application 00fdff5d resources: utime=148s stime=119s maxrss=1706260KB inblock=66650 oublock=1896 minflt=8281849 majflt=14624 nvcsw=292958 nivcsw=33749 ```