tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
300 stars 23 forks source link

LLMs end2end perf regression #9479

Closed mtairum closed 1 day ago

mtairum commented 1 week ago

Updated main post with information from the first few comments

Commit that introduced the issue

-> The performance issues were introduced on commit https://github.com/tenstorrent/tt-metal/commit/87378f73d1e1c55e2f9c3d4d58676bf1521b79fe , which updates the device pool logic.

Issue

5 to 6x end-to-end performance degradation on LLMs. Tested on Mixtral and Falcon40b. Waiting for Llama numbers.

This issue is affecting only end-to-end performance. Device performance (tested on Mixtral) remains the same.

I've generated tracy files for both the bad and good performance (good commit: c1c3f6b). They are available in gdrive here

On the tracy files there is a larger gap between a device execution/iteration and WorkerPop (see figures in comment below). Specifically, the good perf commit shows gaps of 85us ~ 2ms across all 8 devices, versus 10ms ~ 45ms on the bad one.

Reproduction

Requires flag: WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml

Other flags that I have setup in my env:

export TT_METAL_ASYNC_DEVICE_QUEUE=1
export WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml
export TTNN_CONFIG_OVERRIDES='{"enable_fast_runtime_mode": true}'
# Bad perf commit
git checkout 87378f73d1e1c55e2f9c3d4d58676bf1521b79fe
git cherry-pick 14dd9a6a4dafc6973a821bf5bdfe5cc6fe87918c  # Small fix required

# Good perf commit is the one before: c1c3f6b5716c29baa4244485d2ec47a1c6250ce4

# Falcon40b test is the easiest to replicate
pytest models/demos/t3000/falcon40b/tests/test_perf_falcon.py::test_perf_bare_metal[falcon_40b-prefill_seq2048_bfp8-8chips]

For Mixtral reproduction you will need the weights in your machine. Follow the Mixtral README on how to setup (or copy weights from an available machine.

For example:

Once you have the repack weights you just have to point the folder in the flags:

export MIXTRAL_CKPT_DIR=<repacked_output_dir>
export MIXTRAL_TOKENIZER_PATH=<path_to_tokenizer_dir>
export MIXTRAL_CACHE_PATH=<weights_cache_dir>

To run Mixtral perf test:

pytest -svv models/demos/t3000/mixtral8x7b/tests/test_mixtral_perf.py::test_mixtral_model_perf[wormhole_b0-True-2048-150-0.025]

Mixtral-only TODO after perf goes back to normal

Mixtral current perf test is measuring for 1L. The time for each layer is not linear, so we get a more accurate perf estimate when running 32L.

Once the perf is fixed, I'll update the perf test with tighter expected values to avoid this issue in the future. Current target should be 158.4ms e2e perf for 32L and 2048 KV cache length.

mtairum commented 1 week ago

The commit that introduced the perf regression is: https://github.com/tenstorrent/tt-metal/commit/87378f73d1e1c55e2f9c3d4d58676bf1521b79fe

To be able to successfully run a mixtral test with the above commit, one also neesd to cherry-pick: https://github.com/tenstorrent/tt-metal/commit/14dd9a6a4dafc6973a821bf5bdfe5cc6fe87918c

With the above, Mixtral 32L perf decreases from ~160ms to ~850ms

Looking into this...

mtairum commented 1 week ago

FYI @aliuTT

aliuTT commented 1 week ago

What machine are you running on, what devices are you using? I did add more cleanup and bug fix type of logic in that commit. Basically resetting device runtime objects, deallocating more buffers, which helps general stability in runtime. Not sure if those are included in the e2e runtime.

mtairum commented 1 week ago

This was tested on T3002 (sjc-lab-t3002) using all 8 devices with program cache enabled.

Our e2e numbers above are capturing the model run + getting the output back to host. We do 1 warmup iteration and then measure again for a single iteration:

  profiler.start(f"model_run_for_inference_{i}")
  tt_out = tt_model(decode_input, start_pos, current_pos, attn_mask, rot_mat)

  # Convert ttnn tensor to torch tensor
  tt_output_torch = (
      ttnn.to_torch(tt_out, mesh_composer=ConcatMeshToTensor(tt_model.device_mesh, dim=0))[0]
      .squeeze(1)
      .view(batch, seqlen, -1)
      .detach()
      .float()
  )
  profiler.end(f"model_run_for_inference_{i}")

Also tested this on our demo code (with profiler and tracy disabled) and the perf reported there is similar (around 850ms e2e), when printed to terminal using python time. So that rules out any slowdown due to profiler.

I've run tracy for 1L on both the previous commit (good_perf) and this commit (bad_perf). The tracy files are shared in gdrive: https://drive.google.com/drive/folders/1s9Crrs3St1qw96om5VEbJsvjc5qZNpgm?usp=sharing

I noticed that on the good_perfone the gaps between running the iteration on device and PopWork are way shorter than the bad_perf trace.

On good_perf:

On bad_perf:

bad_perf tracy picture image

good_perf tracy picture image

The to op on the main threads takes similar amounts of time in both. 13ms on bad and 14.26ms on good.

mtairum commented 1 week ago

I'm bumping this issue to P0 since I've confirmed it's affecting other models. @aliuTT if you can prioritise looking into this issue we would apreaciate it, to avoid newer commits cascading the perf impact.

Falcon40b prefill perf test shows a similar perf degradation on the same commit onwards 87378f73d1e1c55e2f9c3d4d58676bf1521b79fe

Bad perf commit 87378f73d1e1c55e2f9c3d4d58676bf1521b79fe

Good perf commit c1c3f6b571

FIY @johanna-rock-tt

--

Additionally I checked device perf on Mixtral perf test and it isn't affected by this. All ops timings matched almost perfectly one to one.

Only end2end perf is affected by this commit.

aliuTT commented 1 week ago

I wasn't able to repro the passing case because my machine had old FW. Finally installed the latest, hopefully can repro issue now :(

aliuTT commented 1 week ago

Unfortunately still can't repro the on passing commit with new FW. I'm doing:

git checkout c1c3f6b571
CONFIG=Release ./build_metal.sh
 pytest "models/demos/t3000/falcon40b/tests/test_perf_falcon.py::test_perf_bare_metal[falcon_40b-prefill_seq2048_bfp8-8chips]"

This sequence is hanging for me. Are there other env variables to set?

mtairum commented 4 days ago

@aliuTT Where is it hanging?

The warmup run should take approximately 5 min (timings below).

I'm running with the following flags:

export TT_METAL_ASYNC_DEVICE_QUEUE=1
export TTNN_CONFIG_OVERRIDES='{"enable_fast_runtime_mode": true}'
# This is necesary
export WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml

# These shouldn't be necessary
export ENABLE_TRACY=1
export ENABLE_PROFILER=1

I've tried to reproduce your steps on sjc-lab-3002. Seeing the error: RuntimeError: TT_FATAL @ ../tt_metal/impl/allocator/allocator.cpp:105: num_shards.value() <= num_compute_banks

-> The eth dispatch flag is required: WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest "models/demos/t3000/falcon40b/tests/test_perf_falcon.py::test_perf_bare_metal[falcon_40b-prefill_seq2048_bfp8-8chips]" works fine.

aliuTT commented 4 days ago

Ah I wasn't using eth dispatch, it wasn't mentioned in the original repro steps. I'll try this again

aliuTT commented 4 days ago

I added these env flags:

export TT_METAL_ASYNC_DEVICE_QUEUE=1
export TTNN_CONFIG_OVERRIDES='{"enable_fast_runtime_mode": true}'
# This is necesary
export WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml

WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest "models/demos/t3000/falcon40b/tests/test_perf_falcon.py::test_perf_bare_metal[falcon_40b-prefill_seq2048_bfp8-8chips]" is hanging here:

-instruct/transformer.word_embeddings.weight_6_8_dtype_BFLOAT16_layout_ROW_MAJOR.bin of shape ttnn.Shape([1, 1, 65024, 1024])
2024-06-24 14:55:00.597 | DEBUG    | ttnn.operations.core:as_tensor:698 - Loaded cache for models/demos/t3000/falcon40b/datasets/tt_dnn-models/tt/Falcon/tiiuae/falcon-40b-instruct/transformer.word_embeddings.weight_7_8_dtype_BFLOAT16_layout_ROW_MAJOR.bin of shape ttnn.Shape([1, 1, 65024, 1024])
Loading decoder layers: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [01:00<00:00,  1.00s/it]
2024-06-24 14:56:42.187 | INFO     | models.demos.t3000.falcon40b.tests.test_perf_falcon:run_test_FalconCausalLM_end_to_end:148 - Running Falcon model once to fill caches -> disable profiler
2024-06-24 14:56:42.190 | INFO     | tracy:signpost:43 - COMPILE_RUN

I'll try adding the ENABLE_TRACY=1 and ENABLE_PROFILER=1 flags.

aliuTT commented 4 days ago

Still hangs for me. I'll reach out to book a meeting with you @mtairum . I probably need to borrow your machine, or check FW versions.

mikevin920 commented 4 days ago

We are also seeing similar degradation, can book a meeting and use our machine as well

aliuTT commented 4 days ago

Latest FW 80.9 on tt firmware fw_pack-80.9.0.0.fwbundle is broken for all the tests I've been given here. Where do I get the "working" FW for these models?

aliuTT commented 4 days ago

@mtairum Can you try your mixtral test with sync and async on both c1c3f6b571 and 87378f73d1e. I'm debugging with Kevin and Ammar, we're seeing without the DevicePool changes in 87378f73d1e, async mode is causing a perf degradation for larger sequence lengths for LLAMA.

aliuTT commented 4 days ago

@mtairum Can you also try without PROFILER and TRACY enabled. They can perturb e2e measurements.

mtairum commented 3 days ago

@aliuTT I've tried without PROFILER/TRACY enabled, and with/without async.

Only async e2e perf seems affected, like you mentioned. Numbers below.

Good commit: c1c3f6b571

Bad commit 87378f73d1e

aliuTT commented 3 days ago

I was getting a hang on sjc-snva-t3011 and sjc-lab-t3005 for mixtral tests. I'm able to repro a passing case on sjc-lab-t3004 will continue looking.

aliuTT commented 2 days ago

@mikevin920 @mtairum Found the bug, which was causing device to always set its work executor to worker 0. Try out dc3c72f as a fix, on your repro cases for a few weeks ago and on main if possible. I was seeing the mixtral perf recovering to ~250ms e2e again with this change. I have a PR to push this to main.

mtairum commented 1 day ago

Great work @aliuTT .

I did some testing and can confirm that the perf on main matches what we would expect. Mixtral e2e measured at 117ms.

Also tested on previous bad commit with the fix on and the perf is around ~150ms which is close to what it was as well. 👍

aliuTT commented 1 day ago

Pushed to main, seems like issue is resolved.