guillaume-be / rust-bert

Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)
https://docs.rs/crate/rust-bert
Apache License 2.0
2.65k stars 215 forks source link

Failed to load converted models. #260

Closed garyhai closed 2 years ago

garyhai commented 2 years ago

I am very interested in multilingual embedding models. But there is no converted multilingual model. According to comments of the example sentence_embeddings_local, I converted many models successfully. But when I run the example, I met different errors. Any suggestion? Thanks.

paraphrase-multilingual-MiniLM-L12-v2

Error: Tokenizer error: File not found error: resources/paraphrase-multilingual-MiniLM-L12-v2/vocab.txt vocabulary file not found :No such file or directory (os error 2)

distiluse-base-multilingual-cased-v1

Error: Tch tensor error: cannot find the tensor named distilbert.transformer.layer.5.ffn.lin2.weight in resources/distiluse-base-multilingual-cased-v1/rust_model.ot

paraphrase-multilingual-mpnet-base-v2

thread 'main' panicked at 'could not parse configuration: Error("unknown variant `xlm-roberta`, expected one of `Bart`, `Bert`, `DistilBert`, `Deberta`, `DebertaV2`, `Roberta`, `XLMRoberta`, `Electra`, `Marian`, `MobileBert`, `T5`, `Albert`, `XLNet`, `GPT2`, `OpenAiGpt`, `Reformer`, `ProphetNet`, `Longformer`, `Pegasus`, `GPTNeo`, `MBart`, `M2M100`, `FNet`", line: 17, column: 29)', src/common/config.rs:42:56
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
guillaume-be commented 2 years ago

Hello @garyhai ,

Can you please share the code that was used to generate these errors so I can reproduce? Do you have issues with the SentenceEmbeddingsBuilder class or by loading these resources manually?

garyhai commented 2 years ago

I followed the instructions in examples/sentence_embeddings_local.rs, and replaced the model name all-MiniLM-L12-v2 with multilingual embeddings models, which names as mentioned. Then run the example by cargo run --example sentence_embeddings_local.

/// Download model:
///   ```sh
///   git lfs install
///   git -C resources clone https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2
///   ```
/// Prepare model:
///   ```sh
///   python ./utils/convert_model.py resources/all-MiniLM-L12-v2/pytorch_model.bin
///   ```

Thank you so much for this amazing project.

guillaume-be commented 2 years ago

Thank you for the additional information @garyhai . I could reproduce the error with all 3 models:

  1. paraphrase-multilingual-MiniLM-L12-v2 For this model the issue comes from the definition of the tokenizer and associated files. While the standard BERT model relies on a vocabulary with word pieces, this implementation seems to rely on a Unigram model. A Sentencepiece-compatible file is provided (sentencepiece.bpe.model), unfortunately loading it in the sentencepiece library leads to different results than with Huggingface's AutoTokenizer:

    from transformers import AutoTokenizer, AutoModel
    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path='sentence-transformers/paraphrase-multilingual-mpnet-base-v2')
    sentences = 'This is an example sentence'
    encoded_input = tokenizer.encode(sentences, padding=True, truncation=True, return_tensors='pt')
    >>> tensor([[     0,   3293,     83,    142,  27781, 149357,      2]])

    while

    import sentencepiece
    tokenizer = sentencepiece.SentencePieceProcessor("path/to/sentencepiece.bpe.model")
    sentences = 'This is an example sentence'
    encoded_input = tokenizer.encode(sentences, padding=True, truncation=True, return_tensors='pt')
    >>> [3292, 82, 141, 27780, 149356]

    Where all indices are systematically shifted by one unit. The tokenizer implemented as part of this crates match the original implementation of the algorithm and I am unsure about the operations made in AutoTokenizerFast made so result in this tokenization output. Note that you could probably use Huggingface's implementation at https://github.com/huggingface/tokenizers, tokenizer your input and then use the resources in this library to embed the encoded input. You probably will need to import the required components individually and won't be able to leverage the ready-to-use SentenceEmbeddingsModel pipeline.

  2. distiluse-base-multilingual-cased-v1 The distil-based model have a special conversion process, requiring to add a prefix for the named variables and suppressing the suffix for the dense projection layer. I have tested and could embed documents with the updated conversion process:

    python ./utils/convert_model.py resources/path/to/pytorch_model.bin --prefix distilbert.
    python ./utils/convert_model.py resources/path/to/2_Dense/pytorch_model.bin --suffix

    I am pushing some changes in https://github.com/guillaume-be/rust-bert/pull/263 to document this conversion stage.

  3. paraphrase-multilingual-mpnet-base-v2 This model unfortunately faces the same situation as (1) - and uses a tokenizer that is not supported by the current crate implementation.

garyhai commented 2 years ago

Very nice response. Thanks again.