pytorch / torchtune

A Native-PyTorch Library for LLM Fine-tuning
BSD 3-Clause "New" or "Revised" License
3.55k stars 291 forks source link

Enable profiler only on rank 0 #885

Open rohan-varma opened 2 months ago

rohan-varma commented 2 months ago

Profiler is a little hard to use for distributed training since it gets enabled for all ranks. This results in the trace file being overwritten and it being unknown which rank the profile came from.

For now, we can probably enable the profiler only on rank 0. We might lose some information such as detecting stragglers on non-zero ranks, but I'm not particularly concerned about straggle issue for single node use cases.

SLR722 commented 1 month ago

Torch profiler was added as an optional component in https://github.com/pytorch/torchtune/pull/627 and we show case how to use it in lora_finetune_single_device.py recipe which won't have this issue. To address this, we have 2 options

  1. Add additional show case in one distributed recipe
  2. Move the torch profiler show case to distributed recipe if we think distributed recipe has more show case value

cc: @kartikayk