Using llama.cpp - Githubissues

iamlemec / bert.cpp

GGML implementation of BERT model with Python bindings and quantization.

MIT License

51 stars 5 forks source link

Using llama.cpp #14

Open PrithivirajDamodaran opened 8 months ago

PrithivirajDamodaran commented 8 months ago

I am trying to use llama.cpp as you suggested its merged there for the same baai 1.5 embedding models , could you please help me how should I get started. I cant figure out the equivalent of bert_tokenize part there.

Thanks

iamlemec commented 8 months ago

Hi @PrithivirajDamodaran! Sorry I missed your last issue. Forgot to turn on notifications for this repo.

For everyday stuff I just use the llama-cpp-python bindings. Here's an example of how to do embeddings: https://github.com/abetlen/llama-cpp-python#embeddings. You can also just use Llama.embed if you want to get the raw embedding as a list. (Note: there's a typo in the example, it should be embedding=True not embeddings=True)

As for the MLM issue. Right now, llama.cpp doesn't suppor that MLM head part of the model. It'll only get you up to the D=768. embeddings. It's possible turn off pooling for the embeddings and then just fetch the token level embeddings manually. You can do that in raw llama.cpp but the that option (do_pooling=False) hasn't found its way into llama-cpp-python yet. I'm thinking about making a PR for that today, which should hopefully be merged soon.

PrithivirajDamodaran commented 8 months ago

Hey @iamlemec , Thanks for taking the time out and all the awesome work you are doing.

I was interested to know about llama.cpp Bert merge because you mentioned in the defunct notice "it's way faster". Will see the code but it will be easier to know what were the optimisations :)
Thanks adding the pooling flag, it will be useful if we need access to the raw embeddings. But MLM head is just a combination of GeLU and linear layers. I am not fully acquainted ggml APIs, will see how best I can add that from side as well.
Also I am looking for bare metal performance so python bindings currently aren't in my radar.

As we speak I am working on a fork for the community to take full advantage of all the awesome work that been done in this space🙏. Will share more soon.

Cheers Prithivi

iamlemec commented 8 months ago

@PrithivirajDamodaran Looks cool! Yeah, so I haven't done benchmarks in a bit, but the main reason it should be faster is that llama.cpp packs different sequences together in a single batch, while here we pad the sequences to the same length and makes a big square batch. This is pretty inefficient when sequenes are coming in with widely varying lengths. There are probably some other reasons, but I think that's the main one.

Yup, pooling options are great, epecially with some of the new approaches coming out like GritLM.