iamlemec / bert.cpp

GGML implementation of BERT model with Python bindings and quantization.
MIT License
51 stars 5 forks source link

Using llama.cpp #14

Open PrithivirajDamodaran opened 8 months ago

PrithivirajDamodaran commented 8 months ago

I am trying to use llama.cpp as you suggested its merged there for the same baai 1.5 embedding models , could you please help me how should I get started. I cant figure out the equivalent of bert_tokenize part there.

Thanks

iamlemec commented 8 months ago

Hi @PrithivirajDamodaran! Sorry I missed your last issue. Forgot to turn on notifications for this repo.

For everyday stuff I just use the llama-cpp-python bindings. Here's an example of how to do embeddings: https://github.com/abetlen/llama-cpp-python#embeddings. You can also just use Llama.embed if you want to get the raw embedding as a list. (Note: there's a typo in the example, it should be embedding=True not embeddings=True)

As for the MLM issue. Right now, llama.cpp doesn't suppor that MLM head part of the model. It'll only get you up to the D=768. embeddings. It's possible turn off pooling for the embeddings and then just fetch the token level embeddings manually. You can do that in raw llama.cpp but the that option (do_pooling=False) hasn't found its way into llama-cpp-python yet. I'm thinking about making a PR for that today, which should hopefully be merged soon.

PrithivirajDamodaran commented 8 months ago

Hey @iamlemec , Thanks for taking the time out and all the awesome work you are doing.

As we speak I am working on a fork for the community to take full advantage of all the awesome work that been done in this space🙏. Will share more soon.

Cheers Prithivi

iamlemec commented 8 months ago

@PrithivirajDamodaran Looks cool! Yeah, so I haven't done benchmarks in a bit, but the main reason it should be faster is that llama.cpp packs different sequences together in a single batch, while here we pad the sequences to the same length and makes a big square batch. This is pretty inefficient when sequenes are coming in with widely varying lengths. There are probably some other reasons, but I think that's the main one.

Yup, pooling options are great, epecially with some of the new approaches coming out like GritLM.