Open mtairum opened 1 month ago
Current status:
Going 4k or above prefill lengths generates bad outputs. I've also increased MLP datatype to bfloat8.
Below are the ouputs when I pass the following prompt:
Prompt: Which book is the following excerpt from?
followed by an excerpt of the Take of Two cities
book.
This huge prompt is sliced to 1k, 2k, 4k, 8k and 16k to test out those sizes.
[1k]
[/INST]
The excerpt is from "A Tale of Two Cities" by Charles Dickens. The novel is set in the late 18th century, during the French Revolution's Reign of Terror. The book begins in England, and then moves to France, with a back and forth movement between the two countries throughout the story. The opening paragraph describes the duality of life in the two cities - Paris and London - during this time period, and sets the stage for the events that are to come.
[2k]
[/INST]
The excerpt is from the novel "A Tale of Two Cities" by Charles Dickens. The book is set in the period of the French Revolution. The particular chapter is named "The Period".
The passage you provided is the celebrated starting of the book, in which the narrator describes
[4k]
[/INST]
The Doverturned to have everything before [It suddenly stops here since it reaches and eos token]
[8k] <- Already pretty bad
[/INST]
"—" [Followed by 110 newlines]
[16k]
[/INST]
they were he had beenathree,—not a little the man and herald the man, to the while he washer. It was the clock,—she was the passenger, and her, to the while he had beenathlet her, as he was hissingularly, as he washer eyes, as he was the passenger, as he was he was he was he was he was he had he was he was he was he had he was he had beenathlet her. He was he was he was he wasp it was he was hissing
Added reproduction steps to the description.
Current reproduction steps:
git clone mixtral-32k-demo
pytest models/demos/t3000/mixtral8x7b/demo/demo_with_prefill.py::test_mixtral8x7b_demo[wormhole_b0-True-tale_of_two_cities_instruct]
Tale of two cities in here: models/demos/t3000/mixtral8x7b/demo/input_tale_of_two_cities_32k.txt
input_tale_of_two_cities_32k.txt
Update Mixtral
demo_with_prefill.py
demo script with prompts up to 16k tokens.We support KV cache sizes up to 32K. If we make the prompt 32k tokens and prefill that, we cannot generate any more new tokens, hence the limit above.
mixtral-32k-demo
If the generated tokens look bad, we should increase the MLP weights back to bfloat8.
Reproduce
Current reproduction steps:
git clone mixtral-32k-demo
pytest models/demos/t3000/mixtral8x7b/demo/demo_with_prefill.py::test_mixtral8x7b_demo[wormhole_b0-True-tale_of_two_cities_instruct]
Tale of two cities in here:
models/demos/t3000/mixtral8x7b/demo/input_tale_of_two_cities_32k.txt
input_tale_of_two_cities_32k.txt