huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
9.04k stars 799 forks source link

OverflowError: int too big to convert #639

Closed edurenye closed 6 months ago

edurenye commented 3 years ago

I'm trying to do summarization in Spanish and the only model I found is: https://huggingface.co/mrm8488/bert2bert_shared-spanish-finetuned-muchocine-review-summarization

But when I do:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("mrm8488/bert2bert_shared-spanish-finetuned-muchocine-review-summarization")
model = AutoModelForSeq2SeqLM.from_pretrained("mrm8488/bert2bert_shared-spanish-finetuned-muchocine-review-summarization")

I get a warning:

The following encoder weights were not tied to the decoder ['bert/pooler']
The following encoder weights were not tied to the decoder ['bert/pooler']

And then when I do:

batch = tokenizer.prepare_seq2seq_batch(src_texts=[text])

I get the following error:

OverflowError: int too big to convert

Am I doing something wrong or is a bug?

mrm8488 commented 3 years ago

Hi, @edurenye. First, it is not a model for general purpose summarization. It is just for movies reviews. Second, I think your issue should be in the repo of HF/Transformers because it is not a tokenizer problem. Third: I will send you the PyTorch code of an example of the model working.

edurenye commented 3 years ago

Hi @mrm8488, thank you very much for your help. You are right I put this issue in the wrong repository, sorry for that, could you please move the issue to the right repository or should I create a new one there? If the model is not for summarization then why it appears when I filter by summarization? See: https://huggingface.co/models?filter=es&pipeline_tag=summarization I guess there is an issue with the filters then. How can I find a model for summarization in Spanish? Yes, please, if you could send me that code I would appreciate it

ugm2 commented 3 years ago

@edurenye He is saying that the model is not for GENERAL purpose summarisation, meaning it's just for summarising reviews, not general text. You could always fine tune a Spanish version of BERT: https://huggingface.co/Geotrend/bert-base-es-cased

edurenye commented 3 years ago

Thank you! I'll try that BERT base model for Spanish.

Other than that, I think the auto-models should be better tagged then, and have better error messages when you try to use a incompatible model.

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.