LLMs end2end perf regression

mtairum commented 1 week ago

Updated main post with information from the first few comments

Commit that introduced the issue

-> The performance issues were introduced on commit https://github.com/tenstorrent/tt-metal/commit/87378f73d1e1c55e2f9c3d4d58676bf1521b79fe , which updates the device pool logic.

Issue

5 to 6x end-to-end performance degradation on LLMs. Tested on Mixtral and Falcon40b. Waiting for Llama numbers.

This issue is affecting only end-to-end performance. Device performance (tested on Mixtral) remains the same.

I've generated tracy files for both the bad and good performance (good commit: c1c3f6b). They are available in gdrive here

On the tracy files there is a larger gap between a device execution/iteration and WorkerPop (see figures in comment below). Specifically, the good perf commit shows gaps of 85us ~ 2ms across all 8 devices, versus 10ms ~ 45ms on the bad one.

Reproduction

Requires flag: WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml

Other flags that I have setup in my env:

export TT_METAL_ASYNC_DEVICE_QUEUE=1
export WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml
export TTNN_CONFIG_OVERRIDES='{"enable_fast_runtime_mode": true}'

# Bad perf commit
git checkout 87378f73d1e1c55e2f9c3d4d58676bf1521b79fe
git cherry-pick 14dd9a6a4dafc6973a821bf5bdfe5cc6fe87918c  # Small fix required

# Good perf commit is the one before: c1c3f6b5716c29baa4244485d2ec47a1c6250ce4

# Falcon40b test is the easiest to replicate
pytest models/demos/t3000/falcon40b/tests/test_perf_falcon.py::test_perf_bare_metal[falcon_40b-prefill_seq2048_bfp8-8chips]

For Mixtral reproduction you will need the weights in your machine. Follow the Mixtral README on how to setup (or copy weights from an available machine.

For example:

sjc-labt-3002
/home/mtairum/mistral/Mixtral-8x7B-v0.1/repack_weights.pt
/home/mtairum/mistral/Mixtral-8x7B-v0.1/tokenizer.model

Once you have the repack weights you just have to point the folder in the flags:

export MIXTRAL_CKPT_DIR=<repacked_output_dir>
export MIXTRAL_TOKENIZER_PATH=<path_to_tokenizer_dir>
export MIXTRAL_CACHE_PATH=<weights_cache_dir>

To run Mixtral perf test:

pytest -svv models/demos/t3000/mixtral8x7b/tests/test_mixtral_perf.py::test_mixtral_model_perf[wormhole_b0-True-2048-150-0.025]

Mixtral-only TODO after perf goes back to normal

Mixtral current perf test is measuring for 1L. The time for each layer is not linear, so we get a more accurate perf estimate when running 32L.

Once the perf is fixed, I'll update the perf test with tighter expected values to avoid this issue in the future. Current target should be 158.4ms e2e perf for 32L and 2048 KV cache length.

mtairum commented 1 week ago

The commit that introduced the perf regression is: https://github.com/tenstorrent/tt-metal/commit/87378f73d1e1c55e2f9c3d4d58676bf1521b79fe

To be able to successfully run a mixtral test with the above commit, one also neesd to cherry-pick: https://github.com/tenstorrent/tt-metal/commit/14dd9a6a4dafc6973a821bf5bdfe5cc6fe87918c

With the above, Mixtral 32L perf decreases from ~160ms to ~850ms

Looking into this...

mtairum commented 1 week ago

FYI @aliuTT

aliuTT commented 1 week ago

What machine are you running on, what devices are you using? I did add more cleanup and bug fix type of logic in that commit. Basically resetting device runtime objects, deallocating more buffers, which helps general stability in runtime. Not sure if those are included in the e2e runtime.

mtairum commented 1 week ago

This was tested on T3002 (sjc-lab-t3002) using all 8 devices with program cache enabled.

Our e2e numbers above are capturing the model run + getting the output back to host. We do 1 warmup iteration and then measure again for a single iteration:

  profiler.start(f"model_run_for_inference_{i}")
  tt_out = tt_model(decode_input, start_pos, current_pos, attn_mask, rot_mat)

  # Convert ttnn tensor to torch tensor
  tt_output_torch = (
      ttnn.to_torch(tt_out, mesh_composer=ConcatMeshToTensor(tt_model.device_mesh, dim=0))[0]
      .squeeze(1)
      .view(batch, seqlen, -1)
      .detach()
      .float()
  )
  profiler.end(f"model_run_for_inference_{i}")

Also tested this on our demo code (with profiler and tracy disabled) and the perf reported there is similar (around 850ms e2e), when printed to terminal using python time. So that rules out any slowdown due to profiler.

I've run tracy for 1L on both the previous commit (good_perf) and this commit (bad_perf). The tracy files are shared in gdrive: https://drive.google.com/drive/folders/1s9Crrs3St1qw96om5VEbJsvjc5qZNpgm?usp=sharing

I noticed that on the good_perfone the gaps between running the iteration on device and PopWork are way shorter than the bad_perf trace.

On good_perf:

longest gap: 2ms (device 5)
shortest gap 85 us

On bad_perf:

Longest gap: 45ms (device 7)
shortest gap: 10ms

bad_perf tracy picture

good_perf tracy picture

The to op on the main threads takes similar amounts of time in both. 13ms on bad and 14.26ms on good.

mtairum commented 1 week ago

I'm bumping this issue to P0 since I've confirmed it's affecting other models. @aliuTT if you can prioritise looking into this issue we would apreaciate it, to avoid newer commits cascading the perf impact.

Falcon40b prefill perf test shows a similar perf degradation on the same commit onwards 87378f73d1e1c55e2f9c3d4d58676bf1521b79fe

Bad perf commit 87378f73d1e1c55e2f9c3d4d58676bf1521b79fe

first_model_run_with_compile: 49.629s
model_run_for_inference: 5.137s

Good perf commit c1c3f6b571

first_model_run_with_compile: 36.469s
model_run_for_inference: 1.023s

FIY @johanna-rock-tt

--

Additionally I checked device perf on Mixtral perf test and it isn't affected by this. All ops timings matched almost perfectly one to one.

Only end2end perf is affected by this commit.

aliuTT commented 1 week ago

I wasn't able to repro the passing case because my machine had old FW. Finally installed the latest, hopefully can repro issue now :(

aliuTT commented 1 week ago

Unfortunately still can't repro the on passing commit with new FW. I'm doing:

git checkout c1c3f6b571
CONFIG=Release ./build_metal.sh
 pytest "models/demos/t3000/falcon40b/tests/test_perf_falcon.py::test_perf_bare_metal[falcon_40b-prefill_seq2048_bfp8-8chips]"

This sequence is hanging for me. Are there other env variables to set?

mtairum commented 4 days ago

@aliuTT Where is it hanging?

The warmup run should take approximately 5 min (timings below).

I'm running with the following flags:

export TT_METAL_ASYNC_DEVICE_QUEUE=1
export TTNN_CONFIG_OVERRIDES='{"enable_fast_runtime_mode": true}'
# This is necesary
export WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml

# These shouldn't be necessary
export ENABLE_TRACY=1
export ENABLE_PROFILER=1

I've tried to reproduce your steps on sjc-lab-3002. Seeing the error: RuntimeError: TT_FATAL @ ../tt_metal/impl/allocator/allocator.cpp:105: num_shards.value() <= num_compute_banks

-> The eth dispatch flag is required: WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest "models/demos/t3000/falcon40b/tests/test_perf_falcon.py::test_perf_bare_metal[falcon_40b-prefill_seq2048_bfp8-8chips]" works fine.