Open rohan-varma opened 2 months ago
Torch profiler was added as an optional component in https://github.com/pytorch/torchtune/pull/627 and we show case how to use it in lora_finetune_single_device.py recipe which won't have this issue. To address this, we have 2 options
cc: @kartikayk
Profiler is a little hard to use for distributed training since it gets enabled for all ranks. This results in the trace file being overwritten and it being unknown which rank the profile came from.
For now, we can probably enable the profiler only on rank 0. We might lose some information such as detecting stragglers on non-zero ranks, but I'm not particularly concerned about straggle issue for single node use cases.