turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.45k stars 257 forks source link

Batched generations are very similar #186

Closed anujnayyar1 closed 9 months ago

anujnayyar1 commented 9 months ago

Hey @turboderp!

I'm facing an unusual problem while performing batched generation to obtain 2 sequences from the same prompt.

Previously, in exllamaV1, to obtain two distinct responses from a single prompt, I would pass the same prompt twice in a list and would then receive two very different sequences. ["x", "x"] -> ["y", "z"]

However, in exllamaV2, using the same approach results in the generated responses being extremely similar to each other. Despite this, each separate invocation of the function gives me a different response. ["x", "x"] -> ["y", " ≈y"]

Is there a way to ensure that each prompt in the list generates a unique response, perhaps by creating a new seed for each prompt?

turboderp commented 9 months ago

This happens because sampling is done in the extension but randomness for the sampler is provided by Python (to make seeds easier to work with), and I only thought to carry one random number over because your use case hadn't really occurred to me at the time.

But I've fixed it with the latest commit, so you'll get different random samplings from multiple instances of the same prompt in a batch.

anujnayyar1 commented 9 months ago

@turboderp Wow! Dude you are amazing. I continue to be absolutely astounded about how well maintained this project is.