tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
443 stars 63 forks source link

[llama] llama 3 70B hangs after multiple batches of full context decode generation #10795

Closed tstescoTT closed 2 months ago

tstescoTT commented 2 months ago

Describe the bug

Consistent hang running llama3 70B demo after several batches of 32x prompts are completed, i.e. 2048 tokens decoded for 32x users multiple times. The hang sometimes occurs on the 2nd batch of 32 prompts, sometimes after 10 or more. The model forward call hangs and does not terminate.

To Reproduce Steps to reproduce the behavior: 1 build tt-metal on commit f0534b4 (https://github.com/tenstorrent/tt-metal/tree/f0534b4467eb7ae994dfa8dbecbbabb2f6893806)

  1. run llama 3 demo with looping (attached patch file: llama3_70b_demo_looping.patch )

Expected behavior The demo should complete all loops of decode token generation.

logs

pytest -svv models/demos/t3000/llama3_70b/demo/demo.py::test_LlamaModel_demo[wormhole_b0-True-short_context-check_disabled-sampling-tt-70b-T3000-80L-decode_only-text_completion-llama3]
...
2024-07-26 09:28:10.337 | INFO     | models.demos.t3000.llama2_70b.demo.demo:run_decode:261 - Loop 2045
2024-07-26 09:28:10.451 | INFO     | models.demos.t3000.llama2_70b.demo.demo:run_decode:261 - Loop 2046
2024-07-26 09:28:10.564 | INFO     | models.demos.t3000.llama2_70b.demo.demo:run_decode:261 - Loop 2047
2024-07-26 09:28:10.566 | INFO     | models.demos.t3000.llama2_70b.demo.demo:latency_printout:303 - Overall throughput: 3.4 ms @ 292.8 tokens/s
2024-07-26 09:28:10.566 | INFO     | models.demos.t3000.llama2_70b.demo.demo:latency_printout:304 - Tokens per second per user: 9.2 tokens/s/u
2024-07-26 09:28:10.566 | INFO     | models.demos.t3000.llama2_70b.demo.demo:latency_printout:305 - User latency: 109.3 ms @ 9.2 tokens/s
2024-07-26 09:28:10.574 | INFO     | models.demos.t3000.llama2_70b.demo.demo:main:108 - batch_idx: 1
2024-07-26 09:28:10.574 | WARNING  | models.demos.t3000.llama2_70b.demo.demo:run_decode:237 - Requested more output tokens than allowed by model. Truncating to 2000 output tokens.
2024-07-26 09:28:10.674 | INFO     | models.demos.t3000.llama2_70b.demo.demo:run_decode:261 - Loop 1
2024-07-26 09:28:10.773 | INFO     | models.demos.t3000.llama2_70b.demo.demo:run_decode:261 - Loop 2
2024-07-26 09:28:10.872 | INFO     | models.demos.t3000.llama2_70b.demo.demo:run_decode:261 - Loop 3
...
2024-07-26 09:29:05.200 | INFO     | models.demos.t3000.llama2_70b.demo.demo:run_decode:261 - Loop 523
2024-07-26 09:29:05.309 | INFO     | models.demos.t3000.llama2_70b.demo.demo:run_decode:261 - Loop 524
2024-07-26 09:29:05.417 | INFO     | models.demos.t3000.llama2_70b.demo.demo:run_decode:261 - Loop 525
2024-07-26 09:29:05.526 | INFO     | models.demos.t3000.llama2_70b.demo.demo:run_decode:261 - Loop 526
2024-07-26 09:29:05.634 | INFO     | models.demos.t3000.llama2_70b.demo.demo:run_decode:261 - Loop 527
-> hung for over 10 minutes before I killed process

Please complete the following environment information:

Screenshot 2024-07-26 at 12 46 07 PM

Additional context Possible issues with Ethernet on latest firmware.

uaydonat commented 2 months ago

@pavlejosipovic could this be a di/dt issue? If so, anything you can recommend?

ksriv001 commented 2 months ago

@tstescoTT I just ran my demo with your patch and was able to run it for the complete 2048 loops. Commenting here for visibility. Lets sync on what might be different.

Commit: 1b62f170c35e29a66c37c1a824e7b7682779b81f

image
tstescoTT commented 2 months ago

Thanks for posting @ksriv001, any info helps. I'm testing on https://github.com/tenstorrent/tt-metal/commit/1b62f170c35e29a66c37c1a824e7b7682779b81f now. Was this just main at the time you pulled or was there a reason for using this specific commit SHA?

ksriv001 commented 2 months ago

@tstescoTT Sure, no problem! Yes it was just the main when I pulled this morning.

tstescoTT commented 2 months ago

I am very happy to report that after rebuilding on https://github.com/tenstorrent/tt-metal/commit/1b62f170c35e29a66c37c1a824e7b7682779b81f I no longer get the hang after running 10x batches and counting.

2024-08-08 19:41:32.178 | INFO     | __main__:run_decode:248 - Loop 2045
2024-08-08 19:41:32.287 | INFO     | __main__:run_decode:248 - Loop 2046
2024-08-08 19:41:32.397 | INFO     | __main__:run_decode:248 - Loop 2047
2024-08-08 19:41:32.508 | INFO     | __main__:latency_printout:306 - Overall throughput: 3.4 ms @ 298.2 tokens/s                                                                  
2024-08-08 19:41:32.509 | INFO     | __main__:latency_printout:307 - Tokens per second per user: 9.3 tokens/s/u                                                                   
2024-08-08 19:41:32.509 | INFO     | __main__:latency_printout:308 - User latency: 107.3 ms @ 9.3 tokens/s                                                                        
2024-08-08 19:41:32.516 | INFO     | __main__:main:125 - finished batch: 10.
2024-08-08 19:41:32.516 | INFO     | __main__:main:115 - starting batch: 11, n_users:= 32
2024-08-08 19:41:32.516 | WARNING  | __main__:run_decode:235 - Requested more output tokens than allowed by model. Truncating to 1938 output tokens.                              
2024-08-08 19:41:32.517 | INFO     | __main__:run_decode:248 - Loop 1   
2024-08-08 19:41:32.618 | INFO     | __main__:run_decode:248 - Loop 2   
2024-08-08 19:41:32.718 | INFO     | __main__:run_decode:248 - Loop 3   
2024-08-08 19:41:32.818 | INFO     | __main__:run_decode:248 - Loop 4   
2024-08-08 19:41:32.919 | INFO     | __main__:run_decode:248 - Loop 5   
2024-08-08 19:41:33.020 | INFO     | __main__:run_decode:248 - Loop 6   
2024-08-08 19:41:33.119 | INFO     | __main__:run_decode:248 - Loop 7   
2024-08-08 19:41:33.218 | INFO     | __main__:run_decode:248 - Loop 8
pavlepopovic commented 2 months ago

@tstescoTT do you know what is the machine that you've run it on when reporting the issue, and what is the machine that you've run it to verify that it doesn't exist on main? Also, @ksriv001 what is the machine that used as well? di/dt seems to be machine(chip)-specific, so it's quite possible that things work on one machine, and not on antother one

tstescoTT commented 2 months ago

It's the same machine that had the hang previously f10cs02, same firmware and tt-kmd also.

tstescoTT commented 2 months ago

BTW, I tested last on v0.51.0-rc19 before https://github.com/tenstorrent/tt-metal/commit/1b62f170c35e29a66c37c1a824e7b7682779b81f.

pavlepopovic commented 2 months ago

That's great :) Thanks

tstescoTT commented 2 months ago

running llama 3.1 70B I just hit:

2024-08-09 08:29:52.710 | INFO     | __main__:run_decode:248 - Loop 2044
2024-08-09 08:29:52.819 | INFO     | __main__:run_decode:248 - Loop 2045
2024-08-09 08:29:52.928 | INFO     | __main__:run_decode:248 - Loop 2046
2024-08-09 08:29:53.037 | INFO     | __main__:run_decode:248 - Loop 2047
2024-08-09 08:29:53.146 | INFO     | __main__:latency_printout:302 - Overall throughput: 3.3 ms @ 305.9 tokens/s
2024-08-09 08:29:53.146 | INFO     | __main__:latency_printout:303 - Tokens per second per user: 9.6 tokens/s/u
2024-08-09 08:29:53.146 | INFO     | __main__:latency_printout:304 - User latency: 104.6 ms @ 9.6 tokens/s
2024-08-09 08:29:53.152 | INFO     | __main__:main:125 - finished batch: 9.
2024-08-09 08:29:53.153 | INFO     | __main__:main:115 - starting batch: 10, n_users:= 32
2024-08-09 08:29:53.153 | WARNING  | __main__:run_decode:235 - Requested more output tokens than allowed by model. Truncating to 1564 output tokens.
2024-08-09 08:29:53.154 | INFO     | __main__:run_decode:248 - Loop 1                                                                                                                   
2024-08-09 08:29:53.257 | INFO     | __main__:run_decode:248 - Loop 2                                                                                                                   
2024-08-09 08:29:53.357 | INFO     | __main__:run_decode:248 - Loop 3                       
2024-08-09 08:29:53.459 | INFO     | __main__:run_decode:248 - Loop 4  
2024-08-09 08:29:53.561 | INFO     | __main__:run_decode:248 - Loop 5 
2024-08-09 08:29:53.662 | INFO     | __main__:run_decode:248 - Loop 6 
...
2024-08-09 08:30:08.430 | INFO     | __main__:run_decode:248 - Loop 155
2024-08-09 08:30:08.530 | INFO     | __main__:run_decode:248 - Loop 156
2024-08-09 08:30:08.630 | INFO     | __main__:run_decode:248 - Loop 157
2024-08-09 08:30:08.729 | INFO     | __main__:run_decode:248 - Loop 158
2024-08-09 08:30:08.829 | INFO     | __main__:run_decode:248 - Loop 159
2024-08-09 08:30:08.928 | INFO     | __main__:run_decode:248 - Loop 160
2024-08-09 08:30:09.028 | INFO     | __main__:run_decode:248 - Loop 161
2024-08-09 08:30:09.128 | INFO     | __main__:run_decode:248 - Loop 162
2024-08-09 08:30:09.229 | INFO     | __main__:run_decode:248 - Loop 163
2024-08-09 08:30:09.330 | INFO     | __main__:run_decode:248 - Loop 164
2024-08-09 08:30:09.430 | INFO     | __main__:run_decode:248 - Loop 165
2024-08-09 08:30:09.531 | INFO     | __main__:run_decode:248 - Loop 166
2024-08-09 08:30:09.634 | INFO     | __main__:run_decode:248 - Loop 167
2024-08-09 08:30:09.734 | INFO     | __main__:run_decode:248 - Loop 168
2024-08-09 08:30:09.834 | INFO     | __main__:run_decode:248 - Loop 169
2024-08-09 08:30:09.934 | INFO     | __main__:run_decode:248 - Loop 170
2024-08-09 08:30:10.034 | INFO     | __main__:run_decode:248 - Loop 171
libc++abi: terminating due to uncaught exception of type std::runtime_error: Read 0xffffffff from PCIE: you should reset the board.
Aborted (core dumped)

and when attempting reset:

(.venv) :~/tt-smi$ tt-smi -r 0,1,2,3
 Starting pci link reset on WH devices at pci indices: 0, 1, 2, 3 
 Failed to recover WH chip, please reboot the system to reset the chip. Exiting... 
(.venv) :~/tt-smi$ ll /dev/tenstorrent/
total 0
drwxr-xr-x  2 root root    120 Aug  6 20:25 ./
drwxr-xr-x 21 root root   4180 Aug  6 20:25 ../
crw-rw-rw-  1 root root 236, 0 Aug  6 20:25 0
crw-rw-rw-  1 root root 236, 1 Aug  6 20:25 1
crw-rw-rw-  1 root root 236, 2 Aug  6 20:25 2
crw-rw-rw-  1 root root 236, 3 Aug  6 20:25 3
pavlepopovic commented 2 months ago

Ok this defintely smells like di/dt. @davorchap @ttmtrajkovic

tstescoTT commented 2 months ago

after rebooting the host I'm only seeing 3 of the 4 n300s come back up

$ ll /dev/tenstorrent/
total 0
drwxr-xr-x  2 root root    100 Aug  9 16:53 ./
drwxr-xr-x 21 root root   4180 Aug  9 16:53 ../
crw-rw-rw-  1 root root 236, 0 Aug  9 16:53 0
crw-rw-rw-  1 root root 236, 1 Aug  9 16:53 1
crw-rw-rw-  1 root root 236, 2 Aug  9 16:53 2
Screenshot 2024-08-09 at 1 15 08 PM
cglagovichTT commented 2 months ago

If you haven't, you may need a cold boot to bring all 4 devices back

tstescoTT commented 2 months ago

Cold boot brought back the 4th card.

I reproduced the same issue on v0.51.0-rc26:

2024-08-09 12:53:19.112 | INFO     | __main__:run_decode:248 - Loop 2044
2024-08-09 12:53:19.222 | INFO     | __main__:run_decode:248 - Loop 2045
2024-08-09 12:53:19.332 | INFO     | __main__:run_decode:248 - Loop 2046
2024-08-09 12:53:19.441 | INFO     | __main__:run_decode:248 - Loop 2047
2024-08-09 12:53:19.551 | INFO     | __main__:latency_printout:302 - Overall throughput: 3.3 ms @ 303.7 tokens/s                                             
2024-08-09 12:53:19.552 | INFO     | __main__:latency_printout:303 - Tokens per second per user: 9.5 tokens/s/u                                              
2024-08-09 12:53:19.552 | INFO     | __main__:latency_printout:304 - User latency: 105.4 ms @ 9.5 tokens/s                                                   
2024-08-09 12:53:19.559 | INFO     | __main__:main:125 - finished batch: 5.
2024-08-09 12:53:19.560 | INFO     | __main__:main:115 - starting batch: 6, n_users:= 32                                                                     
2024-08-09 12:53:19.560 | WARNING  | __main__:run_decode:235 - Requested more output tokens than allowed by model. Truncating to 1929 output tokens.         
2024-08-09 12:53:19.561 | INFO     | __main__:run_decode:248 - Loop 1   
2024-08-09 12:53:19.679 | INFO     | __main__:run_decode:248 - Loop 2   
2024-08-09 12:53:19.779 | INFO     | __main__:run_decode:248 - Loop 3   
2024-08-09 12:53:19.879 | INFO     | __main__:run_decode:248 - Loop 4                                                                                        
2024-08-09 12:53:19.979 | INFO     | __main__:run_decode:248 - Loop 5   
2024-08-09 12:53:20.079 | INFO     | __main__:run_decode:248 - Loop 6   
2024-08-09 12:53:20.180 | INFO     | __main__:run_decode:248 - Loop 7   
2024-08-09 12:53:20.279 | INFO     | __main__:run_decode:248 - Loop 8
2024-08-09 12:54:16.215 | INFO     | __main__:run_decode:248 - Loop 562
2024-08-09 12:54:16.321 | INFO     | __main__:run_decode:248 - Loop 563
libc++abi: terminating due to uncaught exception of type std::runtime_error: Read 0xffffffff from PCIE: you should reset the board.
Aborted (core dumped)
tstescoTT commented 2 months ago

This time, tt-smi -r 0,1,2,3 worked and brought up all 4 n300 cards:

(.venv) ~/tt-smi$ tt-smi -r 0,1,2,3
 Starting pci link reset on WH devices at pci indices: 0, 1, 2, 3 
 Finishing pci link reset on WH devices at pci indices: 0, 1, 2, 3 
 Re-initializing boards after reset.... 
 Detected Chips: 8
tstescoTT commented 2 months ago

Hit the same crash after 29 batches of 32x prompts this time on Llama 3 70B. Cold boot required to bring all 4 cards up again.

tstescoTT commented 2 months ago

Testing using the experimental firmware bundle 80.10.1.0 (https://github.com/tenstorrent/tt-firmware/blob/6154c3f999b22072d6f6ac193559e73f1e490116/experiments/fw_pack-80.10.1.0.fwbundle) with decrease max AICLK to 900MHz appears to resolve the issue.

I tested running 15x25 + 23 = 398 batches (12,736 sample prompt completions, ~26M decode loops):

2024-08-15 10:40:11.483 | INFO     | __main__:run_decode:248 - Loop 2045
2024-08-15 10:40:11.597 | INFO     | __main__:run_decode:248 - Loop 2046
2024-08-15 10:40:11.711 | INFO     | __main__:run_decode:248 - Loop 2047
2024-08-15 10:40:11.825 | INFO     | __main__:latency_printout:302 - Overall throughput: 3.5 ms @ 287.7 tokens/s
2024-08-15 10:40:11.825 | INFO     | __main__:latency_printout:303 - Tokens per second per user: 9.0 tokens/s/u
2024-08-15 10:40:11.825 | INFO     | __main__:latency_printout:304 - User latency: 111.2 ms @ 9.0 tokens/s
2024-08-15 10:40:11.831 | INFO     | __main__:main:125 - finished batch: 22.
2024-08-15 10:40:11.832 | INFO     | __main__:main:115 - starting loop: 15, batch: 23, n_users:= 32
2024-08-15 10:40:11.832 | WARNING  | __main__:run_decode:235 - Requested more output tokens than allowed by model. Truncating to 2014 output tokens.
2024-08-15 10:40:11.833 | INFO     | __main__:run_decode:248 - Loop 1
2024-08-15 10:40:11.936 | INFO     | __main__:run_decode:248 - Loop 2
2024-08-15 10:40:12.039 | INFO     | __main__:run_decode:248 - Loop 3
2024-08-15 10:40:12.142 | INFO     | __main__:run_decode:248 - Loop 4
2024-08-15 10:40:12.245 | INFO     | __main__:run_decode:248 - Loop 5
2024-08-15 10:40:12.348 | INFO     | __main__:run_decode:248 - Loop 6

An apparent performance degration of ~5% is observed (9.5 t/s/u to 9.0 t/s/u). This is anticipated given the reduced frequency.