tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
396 stars 49 forks source link

Update Mixtral demo_with_prefill with 32k seqlen inputs #10934

Open mtairum opened 1 month ago

mtairum commented 1 month ago

Update Mixtral demo_with_prefill.py demo script with prompts up to 16k tokens.

We support KV cache sizes up to 32K. If we make the prompt 32k tokens and prefill that, we cannot generate any more new tokens, hence the limit above.

If the generated tokens look bad, we should increase the MLP weights back to bfloat8.

Reproduce

Current reproduction steps:

  1. git clone mixtral-32k-demo
  2. pytest models/demos/t3000/mixtral8x7b/demo/demo_with_prefill.py::test_mixtral8x7b_demo[wormhole_b0-True-tale_of_two_cities_instruct]

Tale of two cities in here: models/demos/t3000/mixtral8x7b/demo/input_tale_of_two_cities_32k.txt input_tale_of_two_cities_32k.txt

mtairum commented 1 month ago

Current status:

Going 4k or above prefill lengths generates bad outputs. I've also increased MLP datatype to bfloat8.

Below are the ouputs when I pass the following prompt: Prompt: Which book is the following excerpt from? followed by an excerpt of the Take of Two cities book. This huge prompt is sliced to 1k, 2k, 4k, 8k and 16k to test out those sizes.

[1k]
[/INST]
The excerpt is from "A Tale of Two Cities" by Charles Dickens. The novel is set in the late 18th century, during the French Revolution's Reign of Terror. The book begins in England, and then moves to France, with a back and forth movement between the two countries throughout the story. The opening paragraph describes the duality of life in the two cities - Paris and London - during this time period, and sets the stage for the events that are to come.

[2k]
[/INST]
The excerpt is from the novel "A Tale of Two Cities" by Charles Dickens. The book is set in the period of the French Revolution. The particular chapter is named "The Period".

The passage you provided is the celebrated starting of the book, in which the narrator describes

[4k]
[/INST]
The Doverturned to have everything before [It suddenly stops here since it reaches and eos token]

[8k] <- Already pretty bad
[/INST]
"—" [Followed by 110 newlines]

[16k]
[/INST]
 they were he had beenathree,—not a little the man and herald the man, to the while he washer. It was the clock,—she was the passenger, and her, to the while he had beenathlet her, as he was hissingularly, as he washer eyes, as he was the passenger, as he was he was he was he was he was he had he was he was he was he had he was he had beenathlet her. He was he was he was he wasp it was he was hissing
mtairum commented 1 month ago

Added reproduction steps to the description.

Current reproduction steps:

  1. git clone mixtral-32k-demo
  2. pytest models/demos/t3000/mixtral8x7b/demo/demo_with_prefill.py::test_mixtral8x7b_demo[wormhole_b0-True-tale_of_two_cities_instruct]

Tale of two cities in here: models/demos/t3000/mixtral8x7b/demo/input_tale_of_two_cities_32k.txt input_tale_of_two_cities_32k.txt