marella / ctransformers

Python bindings for the Transformer models implemented in C/C++ using GGML library.
MIT License
1.81k stars 137 forks source link

Falcon support? #24

Closed matthoffner closed 1 year ago

matthoffner commented 1 year ago

I've been tracking the Falcon https://github.com/ggerganov/ggml/pull/231 PR, and as I understand currently it won't work on a released version of ggml.

Any suggestions on how to test it config wise are appreciated, I'm assuming llama might not work based on other PRs.

marella commented 1 year ago

I'm also waiting for that PR to be merged. Hopefully it will be merged this weekend https://github.com/ggerganov/ggml/pull/231#issuecomment-1594411046

matthoffner commented 1 year ago

Thanks, do you know if it is possible to point ctransformers to a branch of ggml for testing?

TheBloke commented 1 year ago

+1 to this

However I don't think the ggml PR is the one to implement. Instead I would use the new implementation in ggllm.cpp: https://github.com/cmp-nct/ggllm.cpp

This is now the best Falcon GGML implementation, including CUDA GPU acceleration with support for both 7B and 40B models.

I don't know if this will end up also being in the GGML repo, or maybe even eventually the llama.cpp repo (as ggllm.cpp is a fork of llama.cpp).

But either way, this is the Falcon implementation of interest right now.

And I wonder whether there's even a need to wait for it to be fully stable? It's already useful and being used by people. I have four Falcon GGML repos now:

If ctransformers supported this I think it would help accelerate the use of Falcon GGML.

marella commented 1 year ago

@matthoffner It is not possible to point ctransformers to a branch of ggml as the model code has to be modified to integrate into the common interface I provide for all models.


Thanks @TheBloke I was waiting for the PR to be merged but since you are already providing the files, I added experimental support for Falcon models using the ggllm fork in the latest version 0.2.10

It has CUDA support similar to the LLaMA models. I tested with 7B model but my machine doesn't have enough memory for the 40B model.

TheBloke commented 1 year ago

Fantastic! That's great news, thank you marella. That was super quick.

I will update my READMEs to mention this.

@ParisNeo could you check if this works automatically in LoLLMS, and if so maybe add some Falcon GGML entries? Then I will mention also in the README, and you will be the first UI to support Falcon GGML! :)

ParisNeo commented 1 year ago

image https://discord.com/channels/@me/1097295255801442306/1121584559138545704

I am using 0.2.10.

Am I missing something?

marella commented 1 year ago

You should use model_type="falcon":

llm = AutoModelForCausalLM.from_pretrained("TheBloke/falcon-7b-instruct-GGML", model_type="falcon")
matthoffner commented 1 year ago

@marella @TheBloke Thank you!! I think I've got the 40B with GPU on a HF space:

https://huggingface.co/spaces/matthoffner/falcon-fastapi

TheBloke commented 1 year ago

I've added config.json to my four repos so manual model_type selection by the client shouldn't be needed from now on

image

ParisNeo commented 1 year ago

Thank you very much for this nice work. Tested the 7B model on my pc and it is really solid even compared to 13B from other models. @marella do you have a twitter account so i can follow you and put a link when i credit your work ?

TheBloke commented 1 year ago

Yeah i was wondering about that too

marella commented 1 year ago

Hey, I don't have a twitter account. I'm on LinkedIn https://www.linkedin.com/in/ravindramarella/ but I don't post anything there. If you want to link, you can just link to this repo.

ParisNeo commented 1 year ago

Ok. Very nice profile by the way. Nice to meet you.