hooshvare / parsbert

🤗 ParsBERT: Transformer-based Model for Persian Language Understanding
https://doi.org/10.1007/s11063-021-10528-4
Apache License 2.0
329 stars 38 forks source link

parsbert with flair #15

Closed rezatakhshid closed 3 years ago

rezatakhshid commented 3 years ago

Hi, I'm getting this error when trying to load embedding using flair. Any idea what's going on? Am I using the right model? I just need to use the embedding vectors.

The code:

from flair.data import Sentence
from flair.models import SequenceTagger
from flair.embeddings import TransformerWordEmbeddings

bert_embedding = TransformerWordEmbeddings("HooshvareLab/bert-fa-base-uncased")
sentence = Sentence('علی اکبر به شهر تهران رفت')
bert_embedding.embed(sentence)

The error:

Traceback (most recent call last):
  File "/Users/reza/code/parsbert/playground.py", line 8, in <module>
    bert_embedding.embed(sentence)
  File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/flair/embeddings/base.py", line 60, in embed
    self._add_embeddings_internal(sentences)
  File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/flair/embeddings/token.py", line 923, in _add_embeddings_internal
    self._add_embeddings_to_sentence(sentence)
  File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/flair/embeddings/token.py", line 995, in _add_embeddings_to_sentence
    encoded_inputs = self.tokenizer.encode_plus(tokenized_string,
  File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2378, in encode_plus
    return self._encode_plus(
  File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 458, in _encode_plus
    batched_output = self._batch_encode_plus(
  File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 377, in _batch_encode_plus
    self.set_truncation_and_padding(
  File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 335, in set_truncation_and_padding
    self._tokenizer.enable_truncation(max_length, stride=stride, strategy=truncation_strategy.value)
OverflowError: int too big to convert
m3hrdadfi commented 3 years ago

Hi @rezatakhshid ,

The model_max_length hasn't been set in the tokenizer configuration for that version (v2); the easiest and better solution is to use the fresh one (v3).

bert_embedding = TransformerWordEmbeddings('HooshvareLab/bert-fa-zwnj-base')
sentence = Sentence('علی اکبر به شهر تهران رفت')
bert_embedding.embed(sentence)
Some weights of the model checkpoint at HooshvareLab/bert-fa-zwnj-base were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at HooshvareLab/bert-fa-zwnj-base and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[Sentence: "علی اکبر به شهر تهران رفت"   [− Tokens: 6]]
rezatakhshid commented 3 years ago

Thanks @m3hrdadfi Jan.