Closed candalfigomoro closed 2 years ago
The reason seems to be due to the sentence piece library: https://github.com/google/sentencepiece -> should we maybe post the issue there?
Otherwise @candalfigomoro, could you maybe try to use the tokenizers
libraries instead of sentencepiece
? https://github.com/huggingface/tokenizers
Otherwise @candalfigomoro, could you maybe try to use the
tokenizers
libraries instead ofsentencepiece
? https://github.com/huggingface/tokenizers
@patrickvonplaten Is there a specific tokenizer class in the tokenizers
library that I can use as drop-in replacement for MarianTokenizer
in a translation pipeline
?
Something like pipeline("translation", tokenizer=<TOKENIZER FROM TOKENIZERS LIBRARY>)
? Sorry but I'm new to huggingface's libraries.
Yes you should be able to use:
from transformers import MarianTokenizerFast
tokenizer = MarianTokenizerFast.from_pretrained(...)
Sorry thanks to @SaulLu , I just noticed that we don't have a fast implementation for MarianTokenizer :-/ We should work on adding this one though (I'll put it on my TODO list)
Let me know if you need help, I vaguely remember there was a reason it wasn't added, but I can't put my finger on it atm.
@Narsil @patrickvonplaten
IIRC the fast tokenizer was not added because Marian uses two setencepiece models and two vocabs (for source and target) and it's not clear how to add a fast version for such tokenizers
@patil-suraj Can't we just have two different tokenizer_{source, target}.json
files like you have 2 different spm
files ?
Also ideally I wouldn't mutate the state of the tokenizer like as_target_tokenizer()
does. If those 2 objects are different, they should stay 2 different objects.
This could be a different PR, but I feel it would also solve issues like the fact that len(tokenizer)
doesn't match (There was a recent issue but I can't find it anymore). It's quite important for fast tokenizers too because they handle all the special and added tokens, which would be super confusing if it's the same "object" but they would have 2 different things added. (How do you deal with that in the slow tokenizer ?)
Can't we just have two different tokenizer_{source, target}.json files like you have 2 different spm files ?
I actually don't know about this. I have the same question.
It's quite important for fast tokenizers too because they handle all the special and added tokens, which would be super confusing if it's the same "object" but they would have 2 different things added
Aah, good catch! Yes, you are right. The current design is not ideal.
How do you deal with that in the slow tokenizer ?
In Marian the spm files are only used to tokenize the text and then the vocab
is used to convert token to id and vice-versa.
And before https://github.com/huggingface/transformers/pull/15831 marian used a joint vocab file, so we didn't need to handle this.
But now marian can also have two vocab files (for source and target) and adding tokens is actually not handled correctly.
But now marian can also have two vocab files (for source and target) and adding tokens is actually not handled correctly.
I see ! Ok, let me know if you start working on this, I can help with the tokenizers part.
I've the same issue with this model https://huggingface.co/MoritzLaurer/mDeBERTa-v3-base-mnli-xnli used as Zero Shot Classifier (instead, this other model https://huggingface.co/joeddav/xlm-roberta-large-xnli works fine as Zero Shot Classifier).
So it seems like the deberta_v2/tokenization_deberta_v2.py
tokenizer has the same problem.
Related to https://github.com/huggingface/transformers/pull/15529 ?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@patrickvonplaten @patil-suraj @Narsil So the solution would be to implement a fast tokenizer for Marian?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Environment info
transformers
version: 4.17.0Who can help
@patrickvonplaten @SaulLu @Narsil
Information
Model I am using (Bert, XLNet ...): Marian (Helsinki-NLP/opus-mt-it-en)
The problem arises when using:
I'm trying to parallelize inference through Apache Spark. While this works for other models/pipelines (e.g. https://huggingface.co/joeddav/xlm-roberta-large-xnli), it doesn't work for Marian (e.g. https://huggingface.co/Helsinki-NLP/opus-mt-it-en). The problem is that the tokenizer/model/pipeline object needs to be serialized and broadcasted to the worker nodes, so the tokenizer/model/pipeline object needs to include all the required data. However, for the Marian tokenizer, when the tokenizer/model/pipeline is unserialized and
__setstate__
is called, it tries to reload the tokenizer files (source.spm, target.spm, etc.) from the filesystem (see https://github.com/huggingface/transformers/blob/master/src/transformers/models/marian/tokenization_marian.py#L330), but those files aren't available anymore to the worker nodes, so it fails. The__setstate__
method shouldn't access the filesystem anymore.The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
I get the following error:
Expected behavior
As I've explained in the "Information" section, I should be able to serialize, broadcast, unserialize and apply the tokenizer/model/pipeline within the worker nodes. However, it fails because
__setstate__
is called and it tries to reload the tokenizer files (source.spm, target.spm, etc.) from a filesystem which is not available to the worker nodes. The__setstate__
method shouldn't access the filesystem.