Closed tstescoTT closed 2 months ago
@pavlejosipovic could this be a di/dt issue? If so, anything you can recommend?
@tstescoTT I just ran my demo with your patch and was able to run it for the complete 2048 loops. Commenting here for visibility. Lets sync on what might be different.
Commit: 1b62f170c35e29a66c37c1a824e7b7682779b81f
Thanks for posting @ksriv001, any info helps. I'm testing on https://github.com/tenstorrent/tt-metal/commit/1b62f170c35e29a66c37c1a824e7b7682779b81f now. Was this just main
at the time you pulled or was there a reason for using this specific commit SHA?
@tstescoTT Sure, no problem! Yes it was just the main when I pulled this morning.
I am very happy to report that after rebuilding on https://github.com/tenstorrent/tt-metal/commit/1b62f170c35e29a66c37c1a824e7b7682779b81f I no longer get the hang after running 10x batches and counting.
2024-08-08 19:41:32.178 | INFO | __main__:run_decode:248 - Loop 2045
2024-08-08 19:41:32.287 | INFO | __main__:run_decode:248 - Loop 2046
2024-08-08 19:41:32.397 | INFO | __main__:run_decode:248 - Loop 2047
2024-08-08 19:41:32.508 | INFO | __main__:latency_printout:306 - Overall throughput: 3.4 ms @ 298.2 tokens/s
2024-08-08 19:41:32.509 | INFO | __main__:latency_printout:307 - Tokens per second per user: 9.3 tokens/s/u
2024-08-08 19:41:32.509 | INFO | __main__:latency_printout:308 - User latency: 107.3 ms @ 9.3 tokens/s
2024-08-08 19:41:32.516 | INFO | __main__:main:125 - finished batch: 10.
2024-08-08 19:41:32.516 | INFO | __main__:main:115 - starting batch: 11, n_users:= 32
2024-08-08 19:41:32.516 | WARNING | __main__:run_decode:235 - Requested more output tokens than allowed by model. Truncating to 1938 output tokens.
2024-08-08 19:41:32.517 | INFO | __main__:run_decode:248 - Loop 1
2024-08-08 19:41:32.618 | INFO | __main__:run_decode:248 - Loop 2
2024-08-08 19:41:32.718 | INFO | __main__:run_decode:248 - Loop 3
2024-08-08 19:41:32.818 | INFO | __main__:run_decode:248 - Loop 4
2024-08-08 19:41:32.919 | INFO | __main__:run_decode:248 - Loop 5
2024-08-08 19:41:33.020 | INFO | __main__:run_decode:248 - Loop 6
2024-08-08 19:41:33.119 | INFO | __main__:run_decode:248 - Loop 7
2024-08-08 19:41:33.218 | INFO | __main__:run_decode:248 - Loop 8
@tstescoTT do you know what is the machine that you've run it on when reporting the issue, and what is the machine that you've run it to verify that it doesn't exist on main? Also, @ksriv001 what is the machine that used as well? di/dt seems to be machine(chip)-specific, so it's quite possible that things work on one machine, and not on antother one
It's the same machine that had the hang previously f10cs02, same firmware and tt-kmd also.
BTW, I tested last on v0.51.0-rc19
before https://github.com/tenstorrent/tt-metal/commit/1b62f170c35e29a66c37c1a824e7b7682779b81f.
That's great :) Thanks
running llama 3.1 70B I just hit:
2024-08-09 08:29:52.710 | INFO | __main__:run_decode:248 - Loop 2044
2024-08-09 08:29:52.819 | INFO | __main__:run_decode:248 - Loop 2045
2024-08-09 08:29:52.928 | INFO | __main__:run_decode:248 - Loop 2046
2024-08-09 08:29:53.037 | INFO | __main__:run_decode:248 - Loop 2047
2024-08-09 08:29:53.146 | INFO | __main__:latency_printout:302 - Overall throughput: 3.3 ms @ 305.9 tokens/s
2024-08-09 08:29:53.146 | INFO | __main__:latency_printout:303 - Tokens per second per user: 9.6 tokens/s/u
2024-08-09 08:29:53.146 | INFO | __main__:latency_printout:304 - User latency: 104.6 ms @ 9.6 tokens/s
2024-08-09 08:29:53.152 | INFO | __main__:main:125 - finished batch: 9.
2024-08-09 08:29:53.153 | INFO | __main__:main:115 - starting batch: 10, n_users:= 32
2024-08-09 08:29:53.153 | WARNING | __main__:run_decode:235 - Requested more output tokens than allowed by model. Truncating to 1564 output tokens.
2024-08-09 08:29:53.154 | INFO | __main__:run_decode:248 - Loop 1
2024-08-09 08:29:53.257 | INFO | __main__:run_decode:248 - Loop 2
2024-08-09 08:29:53.357 | INFO | __main__:run_decode:248 - Loop 3
2024-08-09 08:29:53.459 | INFO | __main__:run_decode:248 - Loop 4
2024-08-09 08:29:53.561 | INFO | __main__:run_decode:248 - Loop 5
2024-08-09 08:29:53.662 | INFO | __main__:run_decode:248 - Loop 6
...
2024-08-09 08:30:08.430 | INFO | __main__:run_decode:248 - Loop 155
2024-08-09 08:30:08.530 | INFO | __main__:run_decode:248 - Loop 156
2024-08-09 08:30:08.630 | INFO | __main__:run_decode:248 - Loop 157
2024-08-09 08:30:08.729 | INFO | __main__:run_decode:248 - Loop 158
2024-08-09 08:30:08.829 | INFO | __main__:run_decode:248 - Loop 159
2024-08-09 08:30:08.928 | INFO | __main__:run_decode:248 - Loop 160
2024-08-09 08:30:09.028 | INFO | __main__:run_decode:248 - Loop 161
2024-08-09 08:30:09.128 | INFO | __main__:run_decode:248 - Loop 162
2024-08-09 08:30:09.229 | INFO | __main__:run_decode:248 - Loop 163
2024-08-09 08:30:09.330 | INFO | __main__:run_decode:248 - Loop 164
2024-08-09 08:30:09.430 | INFO | __main__:run_decode:248 - Loop 165
2024-08-09 08:30:09.531 | INFO | __main__:run_decode:248 - Loop 166
2024-08-09 08:30:09.634 | INFO | __main__:run_decode:248 - Loop 167
2024-08-09 08:30:09.734 | INFO | __main__:run_decode:248 - Loop 168
2024-08-09 08:30:09.834 | INFO | __main__:run_decode:248 - Loop 169
2024-08-09 08:30:09.934 | INFO | __main__:run_decode:248 - Loop 170
2024-08-09 08:30:10.034 | INFO | __main__:run_decode:248 - Loop 171
libc++abi: terminating due to uncaught exception of type std::runtime_error: Read 0xffffffff from PCIE: you should reset the board.
Aborted (core dumped)
and when attempting reset:
(.venv) :~/tt-smi$ tt-smi -r 0,1,2,3
Starting pci link reset on WH devices at pci indices: 0, 1, 2, 3
Failed to recover WH chip, please reboot the system to reset the chip. Exiting...
(.venv) :~/tt-smi$ ll /dev/tenstorrent/
total 0
drwxr-xr-x 2 root root 120 Aug 6 20:25 ./
drwxr-xr-x 21 root root 4180 Aug 6 20:25 ../
crw-rw-rw- 1 root root 236, 0 Aug 6 20:25 0
crw-rw-rw- 1 root root 236, 1 Aug 6 20:25 1
crw-rw-rw- 1 root root 236, 2 Aug 6 20:25 2
crw-rw-rw- 1 root root 236, 3 Aug 6 20:25 3
Ok this defintely smells like di/dt. @davorchap @ttmtrajkovic
after rebooting the host I'm only seeing 3 of the 4 n300s come back up
$ ll /dev/tenstorrent/
total 0
drwxr-xr-x 2 root root 100 Aug 9 16:53 ./
drwxr-xr-x 21 root root 4180 Aug 9 16:53 ../
crw-rw-rw- 1 root root 236, 0 Aug 9 16:53 0
crw-rw-rw- 1 root root 236, 1 Aug 9 16:53 1
crw-rw-rw- 1 root root 236, 2 Aug 9 16:53 2
If you haven't, you may need a cold boot to bring all 4 devices back
Cold boot brought back the 4th card.
I reproduced the same issue on v0.51.0-rc26
:
2024-08-09 12:53:19.112 | INFO | __main__:run_decode:248 - Loop 2044
2024-08-09 12:53:19.222 | INFO | __main__:run_decode:248 - Loop 2045
2024-08-09 12:53:19.332 | INFO | __main__:run_decode:248 - Loop 2046
2024-08-09 12:53:19.441 | INFO | __main__:run_decode:248 - Loop 2047
2024-08-09 12:53:19.551 | INFO | __main__:latency_printout:302 - Overall throughput: 3.3 ms @ 303.7 tokens/s
2024-08-09 12:53:19.552 | INFO | __main__:latency_printout:303 - Tokens per second per user: 9.5 tokens/s/u
2024-08-09 12:53:19.552 | INFO | __main__:latency_printout:304 - User latency: 105.4 ms @ 9.5 tokens/s
2024-08-09 12:53:19.559 | INFO | __main__:main:125 - finished batch: 5.
2024-08-09 12:53:19.560 | INFO | __main__:main:115 - starting batch: 6, n_users:= 32
2024-08-09 12:53:19.560 | WARNING | __main__:run_decode:235 - Requested more output tokens than allowed by model. Truncating to 1929 output tokens.
2024-08-09 12:53:19.561 | INFO | __main__:run_decode:248 - Loop 1
2024-08-09 12:53:19.679 | INFO | __main__:run_decode:248 - Loop 2
2024-08-09 12:53:19.779 | INFO | __main__:run_decode:248 - Loop 3
2024-08-09 12:53:19.879 | INFO | __main__:run_decode:248 - Loop 4
2024-08-09 12:53:19.979 | INFO | __main__:run_decode:248 - Loop 5
2024-08-09 12:53:20.079 | INFO | __main__:run_decode:248 - Loop 6
2024-08-09 12:53:20.180 | INFO | __main__:run_decode:248 - Loop 7
2024-08-09 12:53:20.279 | INFO | __main__:run_decode:248 - Loop 8
2024-08-09 12:54:16.215 | INFO | __main__:run_decode:248 - Loop 562
2024-08-09 12:54:16.321 | INFO | __main__:run_decode:248 - Loop 563
libc++abi: terminating due to uncaught exception of type std::runtime_error: Read 0xffffffff from PCIE: you should reset the board.
Aborted (core dumped)
This time, tt-smi -r 0,1,2,3
worked and brought up all 4 n300 cards:
(.venv) ~/tt-smi$ tt-smi -r 0,1,2,3
Starting pci link reset on WH devices at pci indices: 0, 1, 2, 3
Finishing pci link reset on WH devices at pci indices: 0, 1, 2, 3
Re-initializing boards after reset....
Detected Chips: 8
Hit the same crash after 29 batches of 32x prompts this time on Llama 3 70B. Cold boot required to bring all 4 cards up again.
Testing using the experimental firmware bundle 80.10.1.0 (https://github.com/tenstorrent/tt-firmware/blob/6154c3f999b22072d6f6ac193559e73f1e490116/experiments/fw_pack-80.10.1.0.fwbundle) with decrease max AICLK to 900MHz appears to resolve the issue.
I tested running 15x25 + 23 = 398 batches (12,736 sample prompt completions, ~26M decode loops):
2024-08-15 10:40:11.483 | INFO | __main__:run_decode:248 - Loop 2045
2024-08-15 10:40:11.597 | INFO | __main__:run_decode:248 - Loop 2046
2024-08-15 10:40:11.711 | INFO | __main__:run_decode:248 - Loop 2047
2024-08-15 10:40:11.825 | INFO | __main__:latency_printout:302 - Overall throughput: 3.5 ms @ 287.7 tokens/s
2024-08-15 10:40:11.825 | INFO | __main__:latency_printout:303 - Tokens per second per user: 9.0 tokens/s/u
2024-08-15 10:40:11.825 | INFO | __main__:latency_printout:304 - User latency: 111.2 ms @ 9.0 tokens/s
2024-08-15 10:40:11.831 | INFO | __main__:main:125 - finished batch: 22.
2024-08-15 10:40:11.832 | INFO | __main__:main:115 - starting loop: 15, batch: 23, n_users:= 32
2024-08-15 10:40:11.832 | WARNING | __main__:run_decode:235 - Requested more output tokens than allowed by model. Truncating to 2014 output tokens.
2024-08-15 10:40:11.833 | INFO | __main__:run_decode:248 - Loop 1
2024-08-15 10:40:11.936 | INFO | __main__:run_decode:248 - Loop 2
2024-08-15 10:40:12.039 | INFO | __main__:run_decode:248 - Loop 3
2024-08-15 10:40:12.142 | INFO | __main__:run_decode:248 - Loop 4
2024-08-15 10:40:12.245 | INFO | __main__:run_decode:248 - Loop 5
2024-08-15 10:40:12.348 | INFO | __main__:run_decode:248 - Loop 6
An apparent performance degration of ~5% is observed (9.5 t/s/u to 9.0 t/s/u). This is anticipated given the reduced frequency.
Describe the bug
Consistent hang running llama3 70B demo after several batches of 32x prompts are completed, i.e. 2048 tokens decoded for 32x users multiple times. The hang sometimes occurs on the 2nd batch of 32 prompts, sometimes after 10 or more. The model forward call hangs and does not terminate.
To Reproduce Steps to reproduce the behavior: 1 build tt-metal on commit f0534b4 (https://github.com/tenstorrent/tt-metal/tree/f0534b4467eb7ae994dfa8dbecbbabb2f6893806)
Expected behavior The demo should complete all loops of decode token generation.
logs
Please complete the following environment information:
Additional context Possible issues with Ethernet on latest firmware.