pytorch / kineto

A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.
Other
742 stars 170 forks source link

No RANK specified in the TRACE FILE #974

Open Ilex00para opened 3 months ago

Ilex00para commented 3 months ago

I tried to use the hta.trace_analysis.TraceAnalysis with trace files (json) create by the PyTorch profiler as recommended in the documentation.

with torch.profiler.profile( schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1), #schedule the profiler to start after 1 step, warmup for 1 step, run for 3 steps and repeat 1 time on_trace_ready= torch.profiler.tensorboard_trace_handler('./log/resnet18'), #save the trace to tensorboard by using a tracer object record_shapes=True, with_flops=True, with_module_call_stack=True, profile_memory=True, with_stack=True ) as prof:

I get the error that no rank is specified in the trace file, I dont use distributed processes:

2024-08-08 14:03:38,854 - hta - trace_file.py:L61 - ERROR - If the trace file does not have the rank specified in it, then add the following snippet key to the json files to use HTA; "distributedInfo": {"rank": 0}. If there are multiple traces files, then each file should have a unique rank value. 2024-08-08 14:03:38,855 - hta - trace_file.py:L92 - WARNING - There is no item in the rank to trace file map. 2024-08-08 14:03:38,855 - hta - trace.py:L535 - INFO - ranks=[] 2024-08-08 14:03:38,856 - hta - trace.py:L541 - ERROR - The list of ranks to be parsed is empty.

What could be the problem? I saw that you work on the support for tracing distributed workloads is it related to this?

sraikund16 commented 2 months ago

What type of workload were you running? If it was PT2 then it may have been caused by a profiling bug caused by https://github.com/pytorch/pytorch/pull/134893?fbclid=IwZXh0bgNhZW0CMTEAAR3wlutO0aZB7VY_eyNAeP-iuJuchKbqsSZazwpCfpABrmAHHolPDo9jULU_aem_pn80rQGlGfbr3VauJ2RNWg. Please pull latest and try again