Llama e2e burndown - Githubissues

cglagovichTT commented 4 months ago

Track issues which will improve Llama e2e perf

[ ] (Colman) get all set_runtime_args calls below 3 microseconds

Runtime-specific

wait_for_tensor_data_populated in worker threads has high variance and can be as long as 100s of microseconds. Appears that these waits occur when worker threads have back-to-back workloads (the queue is filled faster than it's popped).

In each worker, create_device_tensor, compute_program_hash, set_runtime_args, and HWCommandQueue_enqueue_program each take 5-20 microseconds.

In the main thread, LaunchOp is usually > 20 microseconds. This introduces staggering between the workers.

Op-specific

Certain ops are especially slow to set runtime args. See spreadsheet for ordered list. Top offenders given below:

cglagovichTT commented 4 months ago

Baseline: c2b70fa6fb63d07b70152b71a6aca581239f3d5e (origin/model-team/ttnn-llama)

2 layer demo with

performance govnr
TT_METAL_ASYNC_DEVICE_QUEUE=1
release build

pytest -svv models/experimental/llama2_70b/demo/demo.py::test_LlamaModel_demo[wormhole_b0-True-greedy-tt-70b-T3000-2L-decode_only]

Measure first cached execution, model forward call.

11.49 ms

cglagovichTT commented 4 months ago

Owned multidevice tensor storage https://github.com/tenstorrent/tt-metal/commit/094842cb859a5b5147e4ef8e310c99fcb96cf07f

9 ms

cglagovichTT commented 4 months ago

AllGather is one of the slowest ops to launch since ttnn converts multi-device tensor to a list of tensors and back again. This slows launch and creates a ton of intermediate tensors which are immediately garbage collected, which is also slow

davorchap commented 4 months ago

AllGather is one of the slowest ops to launch since ttnn converts multi-device tensor to a list of tensors and back again. This slows launch and creates a ton of intermediate tensors which are immediately garbage collected, which is also slow

@cfjchu what's the proper way of doing this?

cglagovichTT commented 4 months ago

Python overhead exists but may not be our biggest issue right now. For example, it takes 25 microseconds for the model to call into a tt_lib function due to conditionals and function call

However, worker threads are usually constantly busy and bottleneck e2e perf before python can. Here's an example of a worker which is running behind the main thread

cfjchu commented 4 months ago

AllGather is one of the slowest ops to launch since ttnn converts multi-device tensor to a list of tensors and back again. This slows launch and creates a ton of intermediate tensors which are immediately garbage collected, which is also slow

@cfjchu what's the proper way of doing this?

@cglagovichTT and I did a review of the profile. We're using the ttnn.all_gather(..) and that needs to get upgraded to work directly with multi-device tensors and ported to C++. I'm focusing on this right now.

davorchap commented 4 months ago

Python overhead exists but may not be our biggest issue right now. For example, it takes 25 microseconds for the model to call into a tt_lib function due to conditionals and function call

However, worker threads are usually constantly busy and bottleneck e2e perf before python can. Here's an example of a worker which is running behind the main thread

25us to call from model into worker ttlib is an eternity on an x86.

This could be the bottleneck once we fix issues within the workers.

Is it conditionals / model call stack or pybind overhead ? We have an option to try nanobind

cglagovichTT commented 3 months ago

This could be the bottleneck once we fix issues within the workers.

Is it conditionals / model call stack or pybind overhead ? We have an option to try nanobind

Those 25 us include a conditional and two method calls, and that's before we even get to the C++ impl so it doesn't include pybind

cglagovichTT commented 3 months ago

fyi @uaydonat

cglagovichTT commented 3 months ago

With Joseph's AllGather commit 98b4675d68e124e88a47949a120eeec7e112cc41 we see improved e2e, now at 4.5 tokens/s/u. Tracy shows that worker threads are much less busy, with lots of gaps between ops. This must be because allgather finishes workers finish sooner, so the worker queue is usually empty when new work is pushed.

cglagovichTT commented 3 months ago

Tensor deallocate seems to be a good next target. According to tracy statistics, it accounts for 1.5ms of a single layer forward (3.7 ms). Unclear how accurate the stats are, but tensor deallocate is very frequent and occurs after most ops.

uaydonat commented 3 months ago

@yieldthought was mentioning deallocates as the main bottleneck for mixtral e2e perf as well.

cglagovichTT commented 3 months ago

Potential push_work optimization: define the lambda outside of the loop over target devices so all arguments except for target_device are copied/bound once, then bind target_device to the lambda inside the loop. @tt-asaigal

cglagovichTT commented 3 months ago

I see that some ops have a long delay between their invocation in python and the beginning of LaunchOp. This sheet lists the top offenders. Here's an example in tracy:

cfjchu commented 3 months ago

I see that some ops have a long delay between their invocation in python and the beginning of LaunchOp. This sheet lists the top offenders. Here's an example in tracy:

fyi @bbradelTT @TT-BrianLiu on the tt_lib.operations.primary.matmul_1d and tt_lib.operations.primary.matmul

TT-BrianLiu commented 3 months ago

I am surprised that primary.matmul is longer than primary.matmul_1d. What's the call count for each?

Aside, once everything is cached, these overheads should disappear and won't really matter either way.

cglagovichTT commented 3 months ago

We use primary.matmul probably 2x per layer, and primary.matmul_1d 5x per layer. But the tracy was collected on a cached run.

TT-BrianLiu commented 3 months ago

And how is primary.matmul being used? Directly with fully specified program config?

cglagovichTT commented 3 months ago

One with program config, one without.

cglagovichTT commented 3 months ago

update: e2e perf at 5.8 t/s/u after main thread optimizations, thanks Aditya!

Main thread launches a layer every 2 ms or so. Worker threads take 2.5ms to launch a layer.

LaunchOp is still on the order of 20-30us. Worker threads finish an op in 30-40us.

cglagovichTT commented 3 months ago

@TT-BrianLiu something strange about the MM ops is that they LaunchOp within the worker threads, whereas other worker threads just PopWork. Any clue why that is?

TT-BrianLiu commented 3 months ago

Let's take this offline.

cglagovichTT commented 3 months ago

Tracking main thread optimizations:

1b76fd04df441390eed075dfbd3b91a194c02229 1787a5c43b8e218b2edf899f0f275733f3affaab 9f09804f74bad5e4c87d3c5561a19e7bd0b3b634

cglagovichTT commented 3 months ago

I haven't had much luck with speedup up assemble_runtime_args_commands. Through a few optimizations, I was able to get the full HWCommandQueue_enqueue_program time from 20us to 16us on average. But there are a lot of unordered_map lookups into processor_to_cmd_mapping within the inner loop which updates unique runtime args, which is likely causing the latency.

The long bar up front is getting physical coords from all logical coords at once. This contributed to a large amount of the speedup, since originally we got physical coords from within the loop. The small bars on the right are actual update_cmd calls, which are just memcpys (which I moved into a second for loop).

cglagovichTT commented 3 months ago

Stats on worker PopWork show there is a lot more work to be done

tt-asaigal commented 3 months ago

Main thread optimizations are merged.

cglagovichTT commented 3 months ago

Update on worker thread latency contributions.

All statistics presented will be over 8 threads and 2 layers. To get the amount of time that each function contributes to one layer launch, divide by 16. The contributions of each function to PopWork can be calculated by just dividing the total times.

PopWork: 24.9 ms

CreateShards: 5.48 ms

Cache_hit_set_runtime_args: 4.26 ms

EnqueueProgramImpl: 4.38 ms

OpPostProcess: 1.12 ms

cglagovichTT commented 3 months ago

Tracy overhead might be ~2x in the main thread.

Python timing statments on main thread measure how long it takes to launch 80L without blocking for results. Llama python: Average latency: 0.0401 seconds -> 25 t/s/u Llama cpp: Average latency: 0.0308 seconds -> 32.5 t/s/u

In tracy, the main thread takes ~1.1ms to launch a python layer and ~.7ms to launch a cpp layer, which would give us 88ms and 56ms for full 80L in python and cpp.

Tracy overhead of profiling python function made it seem that cpp impl was necessary but these timing statements show that python is fast enough for our 20 t/s/u goal.

tenstorrent / tt-metal

Llama e2e burndown #8264

Runtime-specific

Op-specific