iamlemec / bert.cpp

GGML implementation of BERT model with Python bindings and quantization.
MIT License
51 stars 4 forks source link

Can BAAI/bge-m3 will be supported? #4

Open sweetcard opened 7 months ago

sweetcard commented 7 months ago

Thank you for your excellent work.

bge-m3 is distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.

When run the following command: python convert-to-ggml.py './bge-m3' f16

Traceback (most recent call last) FileNotFoundError: [Errno 2] No such file or directory: './bge-m3/vocab.txt'

Will make some changes to convert-to-ggml.py to support the new model?

iamlemec commented 7 months ago

Yup, defintely want to support the new magic from BAAI. It looks like they use a different tokenizer (XLMRobertaTokenizer) and a slightly different model architecture (xlm-roberta). I think we can copy over some more general vocab conversion strategies from llama.cpp/convert.py and then tweak the model code a bit.

If you have any tips or ideas on this, I'm all ears. Either way, will be looking into this.

iamlemec commented 7 months ago

Ok, I think it's basically working. The embeddings are still slightly different from what huggingface is giving, but they're pretty close. It seems possible that there's one or two things I'm not getting quite right.

Will keep refining in the coming days.