Open PrithivirajDamodaran opened 8 months ago
Hi @PrithivirajDamodaran! Sorry I missed your last issue. Forgot to turn on notifications for this repo.
For everyday stuff I just use the llama-cpp-python
bindings. Here's an example of how to do embeddings: https://github.com/abetlen/llama-cpp-python#embeddings. You can also just use Llama.embed
if you want to get the raw embedding as a list. (Note: there's a typo in the example, it should be embedding=True
not embeddings=True
)
As for the MLM issue. Right now, llama.cpp
doesn't suppor that MLM head part of the model. It'll only get you up to the D=768
. embeddings. It's possible turn off pooling for the embeddings and then just fetch the token level embeddings manually. You can do that in raw llama.cpp
but the that option (do_pooling=False
) hasn't found its way into llama-cpp-python
yet. I'm thinking about making a PR for that today, which should hopefully be merged soon.
Hey @iamlemec , Thanks for taking the time out and all the awesome work you are doing.
I was interested to know about llama.cpp Bert merge because you mentioned in the defunct notice "it's way faster". Will see the code but it will be easier to know what were the optimisations :)
Thanks adding the pooling flag, it will be useful if we need access to the raw embeddings. But MLM head is just a combination of GeLU and linear layers. I am not fully acquainted ggml APIs, will see how best I can add that from side as well.
Also I am looking for bare metal performance so python bindings currently aren't in my radar.
As we speak I am working on a fork for the community to take full advantage of all the awesome work that been done in this space🙏. Will share more soon.
Cheers Prithivi
@PrithivirajDamodaran Looks cool! Yeah, so I haven't done benchmarks in a bit, but the main reason it should be faster is that llama.cpp
packs different sequences together in a single batch, while here we pad the sequences to the same length and makes a big square batch. This is pretty inefficient when sequenes are coming in with widely varying lengths. There are probably some other reasons, but I think that's the main one.
Yup, pooling options are great, epecially with some of the new approaches coming out like GritLM.
I am trying to use llama.cpp as you suggested its merged there for the same baai 1.5 embedding models , could you please help me how should I get started. I cant figure out the equivalent of bert_tokenize part there.
Thanks