Open oobabooga opened 7 months ago
Hi @oobabooga ! Apologies for my late reply In general we are very interested in adding new quantization schemes in HF transformers. Currently, we're waiting to merge https://github.com/huggingface/transformers/pull/26610 in order to make the support for new quantization methods easier for anyone in the future. We had some internal discussion about adding Llama.cpp inference support in transformers and currently we feel that the LlamaCpp library is quite fast moving to be added in HF transformers making it quite challenging to maintain overall. This is debatable, so feel free to let us know what do you think about it and we can consider adding Llama.cpp after #26610 gets merged
@oobabooga, It should be possible to create externally a subclass to HFTransformers
with llama.cpp
support, independent from GptqHfQuantizer
class. It could be hosted outside transformers
.
Just discussed offline with @ArthurZucker - indeed you can import the auto mapping that live here: https://github.com/huggingface/transformers/blob/main/src/transformers/quantizers/auto.py and add the new quantizers that would firstly live inside text-generation-webui - if we see that everything is quite stable and not subject to a lot of breaking change we can port the quantizers back in transformers core. How does that sound @oobabooga ? I can also work on a PoC PR in your repo as well
@younesbelkada a PR kickstarting that addition in text-generation-webui would be extremely appreciated. I am not familiar enough with the transformers internals to do it myself -- in particular, porting the llama.cpp cache to transformers has been a blocker in my attempts.
llama_cpp_python has a very complete and comprehensive API. The necessary functions should all be in this file:
After your PoC, I should be able to maintain the code afterwards and accept PRs so it becomes more stable over time.
Where are we in support to Llama.CPP with Transformers package?
I would also like to see this as well.
Hey! Thanks @oobabooga for the feature request. Before diving into technicalities, I'd like to understand what is the feature you really want as there are many areas where llama.cpp/transformers can be linked and many toolkits out there doing so.
When you say you want to have llama.cpp as a backend for transformers, do you mean that you would like:
transformers
have llama.cpp/ggml as a backend like we have for torch/tf/jax (Which means a implementation of models in the format)transformers
link to llama.cpp underneath, for example with python bindings, so offering the transformers' python API while leveraging llama.cpp under the hoodtransformers
load gguf
files and use them with the current backends (so torch, for example).For 1., I think there is a significant amount of work to enable that and I'm not sure that it would benefit either llama.cpp or transformers so I don't think that's necessarily want you want to do; if it is, please let me know.
For 2., what is the difference with existing toolkits such as llama-cpp-python or transformers?
For 3., that's actually something we've discussed with @ggerganov and which we're exploring right now. This would mean converting back gguf files to fp32 to use with transformers
in order to use them within the python ecosystem (for example for training, fine-tuning, LoRa, etc.), before converting them back to the gguf file format afterwards.
Thanks for taking the time!
I want to be able to use every sampling function available in the transformers library while having llama.cpp as the backend for the forward passes and the cache handling. That means I do not want to simply convert GGUF models to PyTorch on the fly.
In practical terms, I want to be able to use model.generate()
with parameters like the following, all while being able to split the workload between the GPU and the CPU for the forward passes with the n_gpu_layers
parameter:
penalty_alpha
parameter)prompt_lookup_num_tokens
parameter).guidance_scale
parameter).So, option 2 seems like what I am looking for.
Option 3 also has an appeal in that Transformers doesn't have the ability to load models with fractional bitrates at the moment (like the EXL2 format or the llama.cpp k-quants do), so loading a q4_K_M in PyTorch format would have its own merit. But that's tangential to my motivation while creating this issue.
@oobabooga If I'm getting it right: you'd like to replace the model
object with something that runs llama.cpp under the hood, but would have the exact same .generate
API and user experience. By making the core object compatible with .generate
, you would get the benefit of a single API for both cases and all the generate
features of transformers
available on the llama.cpp side. Is this correct? 🤗
I think it is doable, but it is not entirely trivial -- .generate
depends on several model and model config attributes, so there has to be a wrapper class on the llama.cpp model (inheriting from GenerationMixin
) to define those attributes as well as to overwrite a few internal methods as needed. Coincidently, our next .generate
project is to refactor it so that it stops being a monolith, which greatly pairs with the (likely) need to overwrite a few details.
I'm out of bandwidth until the aforementioned project is completed, but it seems like a nice project for a new external library. WDYT @LysandreJik ?
Yes, you understood my goal correctly. Note that most sampling parameters in Transformers already work in my project with GGUF files through the LlamacppHF class, which inherits from PreTrainedModel
and returns logits in the format expected by the library in the __call__
method. But it doesn't work with sampling parameters that involve passing and returning cache values to __call__
.
Understood! I'm not closed to the idea, let's see as we split the monolithic generate
if this isn't something we can easily solve at the same time.
Option 3 as discussed above is being drafted in https://github.com/huggingface/transformers/pull/30391
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
bump
FYI option 3 (converting GGUF quants to transformers) has already been landed ! #30391
Feature request
I would like to request llama.cpp as a new model backend in the transformers library.
Motivation
llama.cpp offers:
1) Excellent performance in scenarios where memory bandwidth is an issue, namely CPU inference and GPU + CPU inference. 2) Support for a wide range of GPU vendors and models. 3) Adequate quantization accuracy -- I have compared the perplexities of 4-bit GGUF models to GPTQ, AWQ, EXL2, and bitsandbytes and found them to be competitive (link).
By making the transformers library compatible with GGUF models, the llama.cpp performance on consumer hardware could hopefully be integrated with the features available in transformers and its surrounding ecosystem. In particular, it would be interesting to see the following working seamlessly with llama.cpp:
Your contribution
I have implemented a "llamacpp_HF" wrapper in the file below:
https://github.com/oobabooga/text-generation-webui/blob/main/modules/llamacpp_hf.py
It makes it possible to use the transformers
model.generate
with llama.cpp models, and it exemplifies how to make forward calls in llama.cpp and get the logits. It works for perplexity evaluation whenlogits_all=True
is passed while loading the model. I additionally implemented some prefix-matching logic and a hacky way to recognize forward calls for negative prompts to make CFG functional.For the llama.cpp transformers integration, I recommend the following:
from_pretrained
call, having aLlamaCppConfig
object that takes as input arbitrary kwargs that later on get passed to thellama_cpp.Llama
model loading call. That would be similar to theBitsAndBytesConfig
object that is passed tofrom_pretrained
whenload_in_4bit=True
is used. Some important parameters aren_gpu_layers
andn_ctx
; it would be interesting to make this future-proof and allow arbitrary kwargs to be passed toLlamaCppConfig
.I'll tag @younesbelkada who worked with RWKV and AWQ integration in transformers and may find this interesting.