turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.46k stars 255 forks source link

Beam Search Implementation #84

Open ChrisCates opened 11 months ago

ChrisCates commented 11 months ago

Hello Exllama friends,

I was curious what the thoughts are on implementing beam search in v2. In the v1. Beam search was implemented in the core generator.

I was curious what would the requirements be to migrate the same source over to v2. And if there is anything I should be mindful of, if creating a PR migrating v1 beam search to v2.

turboderp commented 11 months ago

It definitely needs to be adapted for the new version, so expect it to need some minor changes at least. But I'm not sure I'd do it the same way. In V1 I avoided using batches so the beam search wouldn't have VRAM overhead, but then of course there was extra latency instead. I think you should be able to get the best of both worlds with a slightly different approach, though. Just haven't quite figured it out yet.

cohan8999 commented 4 months ago

@turboderp have you put more thought into this? I barely understand any of it, but the way I see it there are different strategies where in terms of advantages you have to pick "either or" between the different strategies, correct? Meaning that if one gives you benefit, you lose other benefits by not using the other.

With that being said, would it not be possible to combine different strategies? Thus gaining the benefits of all strategies while even mitigating the disadvantages of some, like those that give generic and monotone outputs.

Oh and by the way: When autosplitting across GPUs, would it not make more sense to always (or at least have a parameter to) choose last-to-first order of GPU loading? That way we reserve the left-aligned space for the system and the right-aligned space for the model-loading, meaning we'd only see an overload when all GPUs are at full capacity.

Currently working on a chatbot application where I want to make some of the more complicated processes simplified, so this would be a great addition if such an implementation is possible 😇

ChrisCates commented 4 months ago

Hey @cohan8999, @turboderp has done a ton of work and there is still tons of work to do and it's my bad for suggesting I'd commit to creating this.

I'll be honest with you. I haven't been doing a lot of llama based SFT these days and am mostly doing with Claude, GPT4/GPT4 SFT these days.

In terms of strategies @cohan8999, no, this does not impact Top K and Top P sampling. This actually enhances the token sampling process.

In regards to multiple algorithms. I'm not sure what you mean. I'm not fully updated to date on the latest token sampling processes and I highly recommend you do a deep dive on the current ecosystem for token sampling. It's not black and white. You don't pick one or the other. They can work in conjunction often... And sometimes cannot.

Cheers, Chris

turboderp commented 2 months ago

Part of the motivation for the dynamic generator is to have a better framework for sampling strategies like beam search, so it's probably coming at some point. It's not in particularly high demand, though, as it's a super-greedy algorithm, and everyone's looking away from that towards more creative random sampling approaches.