Support for ExLlama library

paolorechia commented 1 year ago

Hi, thanks for the guidance project, I really like it!

I've been wondering for sometime about the possibility of extending guidance to support more open source projects, such as ExLlama (https://github.com/turboderp/exllama) and RWKV models (https://github.com/BlinkDL/RWKV-LM).

Is there an interest from the maintainers in adding this support?

I've looked a bit into guidance's source code, and it seems the key feature to get this sort of integration working is having a way to add a bias to the prediction of the models. I'm fairly sure it's possible to do this with these libraries (though might require a pull request in their repositories too).

For instance, the sample function of ExLlama currently already supports receiving logits as a parameter, and even adding a negative bias to constrained tokens: https://github.com/turboderp/exllama/blob/a01b25c884881871a0f75c96bbc582b6581665cb/generator.py#L88

What do you think?

I'd be happy to also give a shot at this, though it may be a bit hard for me since I'm not the core maintainer of any of these repositories.

Thanks!

paolorechia commented 1 year ago

To start, I opened a PR on exllama to support applying a bias to the logits in the generation of a token: https://github.com/turboderp/exllama/pull/104

andysalerno commented 1 year ago

I similarly would love exllama support, as it's currently the fastest and most memory-efficient executor of models that I'm aware of. On my 3080 12GB, a 13B model (4bit gptq) fits comfortably when loaded with exllama even for long context sizes, while the normal gptq/autogptq loaders will run out of memory if my context gets too long.

I played around a bit with the exllama wrapper in oobabooga/text-generation-webui, and hacked on it until guidance accepted it, and got it to work - but only if I disable caching, which unfortunately is a pretty big compromise :( The ExLlamaCache is just too different from the normal HF cache, and guidance seems to expect the HF cache in certain areas (like checking the current context length). But someone with more experience and patience than I could probably get it to work.

Glavin001 commented 1 year ago

I played around a bit with the exllama wrapper in oobabooga/text-generation-webui, and hacked on it until guidance accepted it, and got it to work - but only if I disable caching

@andysalerno : Is this your WIP code which someone could continue to work on?

andysalerno commented 1 year ago

@Glavin001 it is, but I gave up on it since the lower-level stuff went over my head. I was expecting to just make a little shim class that wraps exllama and presents it as a normal HF transformer class, but the way they work internally is so different, that it ended up being too difficult for me.

Using the code you linked, I was able to get token generation using both Guidance and exllama - but without caching, which decimates the performance. And now that I know a bit more, there are probably other issues I didn't account for. So I wouldn't use it.

So far this exllama wrapper seems to be the closest path: https://github.com/oobabooga/text-generation-webui/blob/c95009d2bd55b5b332148a53a445f3c619a004a5/modules/exllama_hf.py#L27C1-L27C34

But it doesn't provide everything Guidance needs to work.

guidance-ai / guidance

Support for ExLlama library #281