guillaume-be / rust-bert

Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)
https://docs.rs/crate/rust-bert
Apache License 2.0
2.5k stars 209 forks source link

Support all-mpnet-base-v2 #359

Open diptanu opened 1 year ago

diptanu commented 1 year ago

I am looking into adding support for sentence-transformers/all-mpnet-base-v2. I have successfully extracted the rust weights and the models are here - https://huggingface.co/diptanuc/all-mpnet-base-v2

The SentenceEmbeddingBuilder doesn't however understand the mpnet architecture. Any thoughts on how new architectures can be added to the library?

guillaume-be commented 1 year ago

Hello, The mpnet architecture would have to be added as a supported model before it can be used for sentence embeddings. The steps are as follows:

  1. Create a MPNet tokenizer on https://github.com/guillaume-be/rust-tokenizers. It seems MPNet is mostly based on a BERT tokenizer so it may be possible to re-use most of the tokenization code and just define a MPNetVocab, or even possibly load MPNet tokenizer/vocab files directly in a BertTokenizer - this would have to be tested for equivalence
  2. Create a MPNet architecture, similar to the other model files. The model architecture looks fairly simple and should be straightforward to port to Rust.
  3. Register the new MPNet architecture for the supported classes (sequence classification, MLM, token classification, and sentence embeddings)
diptanu commented 1 year ago

@guillaume-be Thanks for your feedback! I will fork the repo, make the changes and send you a PR :)

AJV009 commented 1 year ago

Ah WONDERFUL @diptanu even I was looking for this. WAITING for your results :zap: Thanks!

AJV009 commented 1 year ago

:grimacing: Anyone working on this :see_no_evil:

guillaume-be commented 12 months ago

@AJV009 I am not sure - would you like to start working on it?

AJV009 commented 12 months ago

I do have the time BUT I would require more guidance, I am just a rust beginner. :grin:

If you could just explain to me the points you mentioned here in a lil more detail @guillaume-be

Hello, The mpnet architecture would have to be added as a supported model before it can be used for sentence embeddings. The steps are as follows:

  1. Create a MPNet tokenizer on https://github.com/guillaume-be/rust-tokenizers. It seems MPNet is mostly based on a BERT tokenizer so it may be possible to re-use most of the tokenization code and just define a MPNetVocab, or even possibly load MPNet tokenizer/vocab files directly in a BertTokenizer - this would have to be tested for equivalence
  2. Create a MPNet architecture, similar to the other model files. The model architecture looks fairly simple and should be straightforward to port to Rust.
  3. Register the new MPNet architecture for the supported classes (sequence classification, MLM, token classification, and sentence embeddings)

So I can at least give it a try :) Also, I just wanted to mention it's a great initiative, this whole rust-bert thing, I tried using some sentence embedding for a real-time search application AND the embeddings were generated in less than 60ms :exploding_head: :zap: (Sure, I know the scene would change when having traffic multiple requests, but still so far impressive)