Closed mtairum closed 1 day ago
The commit that introduced the perf regression is: https://github.com/tenstorrent/tt-metal/commit/87378f73d1e1c55e2f9c3d4d58676bf1521b79fe
To be able to successfully run a mixtral test with the above commit, one also neesd to cherry-pick: https://github.com/tenstorrent/tt-metal/commit/14dd9a6a4dafc6973a821bf5bdfe5cc6fe87918c
With the above, Mixtral 32L perf decreases from ~160ms to ~850ms
Looking into this...
FYI @aliuTT
What machine are you running on, what devices are you using? I did add more cleanup and bug fix type of logic in that commit. Basically resetting device runtime objects, deallocating more buffers, which helps general stability in runtime. Not sure if those are included in the e2e runtime.
This was tested on T3002 (sjc-lab-t3002
) using all 8 devices with program cache enabled.
Our e2e numbers above are capturing the model run + getting the output back to host. We do 1 warmup iteration and then measure again for a single iteration:
profiler.start(f"model_run_for_inference_{i}")
tt_out = tt_model(decode_input, start_pos, current_pos, attn_mask, rot_mat)
# Convert ttnn tensor to torch tensor
tt_output_torch = (
ttnn.to_torch(tt_out, mesh_composer=ConcatMeshToTensor(tt_model.device_mesh, dim=0))[0]
.squeeze(1)
.view(batch, seqlen, -1)
.detach()
.float()
)
profiler.end(f"model_run_for_inference_{i}")
Also tested this on our demo code (with profiler and tracy disabled) and the perf reported there is similar (around 850ms e2e), when printed to terminal using python time
. So that rules out any slowdown due to profiler.
I've run tracy for 1L on both the previous commit (good_perf
) and this commit (bad_perf
). The tracy files are shared in gdrive: https://drive.google.com/drive/folders/1s9Crrs3St1qw96om5VEbJsvjc5qZNpgm?usp=sharing
I noticed that on the good_perf
one the gaps between running the iteration on device and PopWork
are way shorter than the bad_perf
trace.
On good_perf
:
On bad_perf
:
bad_perf
tracy picture
good_perf
tracy picture
The to
op on the main threads takes similar amounts of time in both. 13ms on bad
and 14.26ms on good
.
I'm bumping this issue to P0 since I've confirmed it's affecting other models. @aliuTT if you can prioritise looking into this issue we would apreaciate it, to avoid newer commits cascading the perf impact.
Falcon40b prefill perf test shows a similar perf degradation on the same commit onwards 87378f73d1e1c55e2f9c3d4d58676bf1521b79fe
Bad perf commit 87378f73d1e1c55e2f9c3d4d58676bf1521b79fe
Good perf commit c1c3f6b571
FIY @johanna-rock-tt
--
Additionally I checked device perf on Mixtral perf test and it isn't affected by this. All ops timings matched almost perfectly one to one.
Only end2end perf is affected by this commit.
I wasn't able to repro the passing case because my machine had old FW. Finally installed the latest, hopefully can repro issue now :(
Unfortunately still can't repro the on passing commit with new FW. I'm doing:
git checkout c1c3f6b571
CONFIG=Release ./build_metal.sh
pytest "models/demos/t3000/falcon40b/tests/test_perf_falcon.py::test_perf_bare_metal[falcon_40b-prefill_seq2048_bfp8-8chips]"
This sequence is hanging for me. Are there other env variables to set?
@aliuTT Where is it hanging?
The warmup run should take approximately 5 min (timings below).
I'm running with the following flags:
export TT_METAL_ASYNC_DEVICE_QUEUE=1
export TTNN_CONFIG_OVERRIDES='{"enable_fast_runtime_mode": true}'
# This is necesary
export WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml
# These shouldn't be necessary
export ENABLE_TRACY=1
export ENABLE_PROFILER=1
I've tried to reproduce your steps on sjc-lab-3002
. Seeing the error:
RuntimeError: TT_FATAL @ ../tt_metal/impl/allocator/allocator.cpp:105: num_shards.value() <= num_compute_banks
-> The eth dispatch flag is required:
WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest "models/demos/t3000/falcon40b/tests/test_perf_falcon.py::test_perf_bare_metal[falcon_40b-prefill_seq2048_bfp8-8chips]"
works fine.
Ah I wasn't using eth dispatch, it wasn't mentioned in the original repro steps. I'll try this again
I added these env flags:
export TT_METAL_ASYNC_DEVICE_QUEUE=1
export TTNN_CONFIG_OVERRIDES='{"enable_fast_runtime_mode": true}'
# This is necesary
export WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml
WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest "models/demos/t3000/falcon40b/tests/test_perf_falcon.py::test_perf_bare_metal[falcon_40b-prefill_seq2048_bfp8-8chips]"
is hanging here:
-instruct/transformer.word_embeddings.weight_6_8_dtype_BFLOAT16_layout_ROW_MAJOR.bin of shape ttnn.Shape([1, 1, 65024, 1024])
2024-06-24 14:55:00.597 | DEBUG | ttnn.operations.core:as_tensor:698 - Loaded cache for models/demos/t3000/falcon40b/datasets/tt_dnn-models/tt/Falcon/tiiuae/falcon-40b-instruct/transformer.word_embeddings.weight_7_8_dtype_BFLOAT16_layout_ROW_MAJOR.bin of shape ttnn.Shape([1, 1, 65024, 1024])
Loading decoder layers: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [01:00<00:00, 1.00s/it]
2024-06-24 14:56:42.187 | INFO | models.demos.t3000.falcon40b.tests.test_perf_falcon:run_test_FalconCausalLM_end_to_end:148 - Running Falcon model once to fill caches -> disable profiler
2024-06-24 14:56:42.190 | INFO | tracy:signpost:43 - COMPILE_RUN
I'll try adding the ENABLE_TRACY=1
and ENABLE_PROFILER=1
flags.
Still hangs for me. I'll reach out to book a meeting with you @mtairum . I probably need to borrow your machine, or check FW versions.
We are also seeing similar degradation, can book a meeting and use our machine as well
Latest FW 80.9 on tt firmware fw_pack-80.9.0.0.fwbundle is broken for all the tests I've been given here. Where do I get the "working" FW for these models?
@mtairum Can you try your mixtral test with sync and async on both c1c3f6b571
and 87378f73d1e
. I'm debugging with Kevin and Ammar, we're seeing without the DevicePool changes in 87378f73d1e
, async mode is causing a perf degradation for larger sequence lengths for LLAMA.
@mtairum Can you also try without PROFILER and TRACY enabled. They can perturb e2e measurements.
@aliuTT I've tried without PROFILER/TRACY enabled, and with/without async.
Only async e2e perf seems affected, like you mentioned. Numbers below.
Good commit: c1c3f6b571
Bad commit 87378f73d1e
I was getting a hang on sjc-snva-t3011
and sjc-lab-t3005
for mixtral tests. I'm able to repro a passing case on sjc-lab-t3004
will continue looking.
@mikevin920 @mtairum Found the bug, which was causing device to always set its work executor to worker 0. Try out dc3c72f
as a fix, on your repro cases for a few weeks ago and on main if possible. I was seeing the mixtral perf recovering to ~250ms e2e again with this change. I have a PR to push this to main.
Great work @aliuTT .
I did some testing and can confirm that the perf on main matches what we would expect. Mixtral e2e measured at 117ms.
Also tested on previous bad commit with the fix on and the perf is around ~150ms which is close to what it was as well. 👍
Pushed to main, seems like issue is resolved.
Updated main post with information from the first few comments
Commit that introduced the issue
-> The performance issues were introduced on commit https://github.com/tenstorrent/tt-metal/commit/87378f73d1e1c55e2f9c3d4d58676bf1521b79fe , which updates the device pool logic.
Issue
5 to 6x end-to-end performance degradation on LLMs. Tested on Mixtral and Falcon40b. Waiting for Llama numbers.
This issue is affecting only end-to-end performance. Device performance (tested on Mixtral) remains the same.
I've generated tracy files for both the bad and good performance (good commit:
c1c3f6b
). They are available in gdrive hereOn the tracy files there is a larger gap between a device execution/iteration and WorkerPop (see figures in comment below). Specifically, the good perf commit shows gaps of 85us ~ 2ms across all 8 devices, versus 10ms ~ 45ms on the bad one.
Reproduction
Requires flag:
WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml
Other flags that I have setup in my env:
For Mixtral reproduction you will need the weights in your machine. Follow the Mixtral README on how to setup (or copy weights from an available machine.
For example:
/home/mtairum/mistral/Mixtral-8x7B-v0.1/repack_weights.pt
/home/mtairum/mistral/Mixtral-8x7B-v0.1/tokenizer.model
Once you have the repack weights you just have to point the folder in the flags:
To run Mixtral perf test:
Mixtral-only TODO after perf goes back to normal
Mixtral current perf test is measuring for 1L. The time for each layer is not linear, so we get a more accurate perf estimate when running 32L.
Once the perf is fixed, I'll update the perf test with tighter expected values to avoid this issue in the future. Current target should be
158.4ms
e2e perf for 32L and 2048 KV cache length.