Investigate what slowed down device perf by almost 2x for certain tests

tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.

https://docs.tenstorrent.com/ttnn/latest/index.html

Apache License 2.0

484 stars 80 forks source link

Investigate what slowed down device perf by almost 2x for certain tests #9467

Closed tt-rkim closed 5 months ago

tt-rkim commented 5 months ago

Before libc++: https://github.com/tenstorrent/tt-metal/actions/runs/9521692385

After libc++: https://github.com/tenstorrent/tt-metal/actions/runs/9530903488

Looks like a lot of tests lengthened in time... almost double.

We need to investigate if libc++ is the culprit. These two runs should be 1 commit apart.

cc: @mo-tenstorrent @yan-zaretskiy @TT-billteng @vtangTT

yan-zaretskiy commented 5 months ago

@tt-rkim Just to clarify — do you mean the timeout issue above for the GS device perf job? I don't see any slowdowns on the N300 WH job. Both took 17 mins and 50ish sec to run and the perf report is very close, prob within the measuring error margin.

tt-rkim commented 5 months ago

Yes just the GS device perf job. Which is super weird because the only thing that can explain Still, not sure what would cause such a difference... not sure if we see this with all the models. So perhaps certain ops cause a slowdown. For ex. we don't run bert perf on WH

The re-run on the commit before libc++ passed. Running again

mo-tenstorrent commented 5 months ago

Trying to reproduce locally as well.

Post commit profiler is also lengthened. @tt-rkim can you think of any other cI job that runs pytest on GS?

mo-tenstorrent commented 5 months ago

Ok so tt_metal ReadFromDevice went from ~2-3s on each call to ~17-18s.

Before:

After:

Device perf regression calls it to get profiling data from DRAM that is why many other tests for GS are fine.

I would say we up the timeout on Device Perf CI for now and start a separate ticket on the ReadFromDevice regression.

tt-rkim commented 5 months ago

Sounds good, will increase timeout

Did you see what the underlying call times are like between the two?

mo-tenstorrent commented 5 months ago

I was trying Bert and the run went from ~1min to 1min 30 seconds. So all of the time increase is coming from ReadFromDevice

tt-rkim commented 5 months ago

Is that tracy? Is it showing anything underneath?

mo-tenstorrent commented 5 months ago

Oh I see, no, no further child calls are recorded.

mo-tenstorrent commented 5 months ago

A bit more info on this,

So the elongation is coming directly from umd read_from_device calls.

tt-rkim commented 5 months ago

@mo-tenstorrent we can close?

mo-tenstorrent commented 5 months ago

Yes, #9516 fixed this