Closed garyhai closed 2 years ago
Hello @garyhai ,
Can you please share the code that was used to generate these errors so I can reproduce? Do you have issues with the SentenceEmbeddingsBuilder
class or by loading these resources manually?
I followed the instructions in examples/sentence_embeddings_local.rs
, and replaced the model name all-MiniLM-L12-v2
with multilingual embeddings models, which names as mentioned. Then run the example by cargo run --example sentence_embeddings_local
.
/// Download model:
/// ```sh
/// git lfs install
/// git -C resources clone https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2
/// ```
/// Prepare model:
/// ```sh
/// python ./utils/convert_model.py resources/all-MiniLM-L12-v2/pytorch_model.bin
/// ```
Thank you so much for this amazing project.
Thank you for the additional information @garyhai . I could reproduce the error with all 3 models:
paraphrase-multilingual-MiniLM-L12-v2
For this model the issue comes from the definition of the tokenizer and associated files. While the standard BERT model relies on a vocabulary with word pieces, this implementation seems to rely on a Unigram model. A Sentencepiece-compatible file is provided (sentencepiece.bpe.model
), unfortunately loading it in the sentencepiece
library leads to different results than with Huggingface's AutoTokenizer
:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path='sentence-transformers/paraphrase-multilingual-mpnet-base-v2')
sentences = 'This is an example sentence'
encoded_input = tokenizer.encode(sentences, padding=True, truncation=True, return_tensors='pt')
>>> tensor([[ 0, 3293, 83, 142, 27781, 149357, 2]])
while
import sentencepiece
tokenizer = sentencepiece.SentencePieceProcessor("path/to/sentencepiece.bpe.model")
sentences = 'This is an example sentence'
encoded_input = tokenizer.encode(sentences, padding=True, truncation=True, return_tensors='pt')
>>> [3292, 82, 141, 27780, 149356]
Where all indices are systematically shifted by one unit. The tokenizer implemented as part of this crates match the original implementation of the algorithm and I am unsure about the operations made in AutoTokenizerFast
made so result in this tokenization output. Note that you could probably use Huggingface's implementation at https://github.com/huggingface/tokenizers, tokenizer your input and then use the resources in this library to embed the encoded input. You probably will need to import the required components individually and won't be able to leverage the ready-to-use SentenceEmbeddingsModel
pipeline.
distiluse-base-multilingual-cased-v1 The distil-based model have a special conversion process, requiring to add a prefix for the named variables and suppressing the suffix for the dense projection layer. I have tested and could embed documents with the updated conversion process:
python ./utils/convert_model.py resources/path/to/pytorch_model.bin --prefix distilbert.
python ./utils/convert_model.py resources/path/to/2_Dense/pytorch_model.bin --suffix
I am pushing some changes in https://github.com/guillaume-be/rust-bert/pull/263 to document this conversion stage.
paraphrase-multilingual-mpnet-base-v2 This model unfortunately faces the same situation as (1) - and uses a tokenizer that is not supported by the current crate implementation.
Very nice response. Thanks again.
I am very interested in multilingual embedding models. But there is no converted multilingual model. According to comments of the example sentence_embeddings_local, I converted many models successfully. But when I run the example, I met different errors. Any suggestion? Thanks.