huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.11k stars 27.04k forks source link

Marian cannot be fully serialized because it accesses the filesystem after the object instantiation #15982

Closed candalfigomoro closed 2 years ago

candalfigomoro commented 2 years ago

Environment info

Who can help

@patrickvonplaten @SaulLu @Narsil

Information

Model I am using (Bert, XLNet ...): Marian (Helsinki-NLP/opus-mt-it-en)

The problem arises when using:

I'm trying to parallelize inference through Apache Spark. While this works for other models/pipelines (e.g. https://huggingface.co/joeddav/xlm-roberta-large-xnli), it doesn't work for Marian (e.g. https://huggingface.co/Helsinki-NLP/opus-mt-it-en). The problem is that the tokenizer/model/pipeline object needs to be serialized and broadcasted to the worker nodes, so the tokenizer/model/pipeline object needs to include all the required data. However, for the Marian tokenizer, when the tokenizer/model/pipeline is unserialized and __setstate__ is called, it tries to reload the tokenizer files (source.spm, target.spm, etc.) from the filesystem (see https://github.com/huggingface/transformers/blob/master/src/transformers/models/marian/tokenization_marian.py#L330), but those files aren't available anymore to the worker nodes, so it fails. The __setstate__ method shouldn't access the filesystem anymore.

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

from transformers import pipeline

translator = pipeline("translation", model=model_dir)
broadcasted_translator = spark_session.sparkContext.broadcast(translator)

def compute_values(iterator):
    for df in iterator:
        batch_size = 32
        sequences = df["text"].to_list()
        res = []
        for i in range(0, len(sequences), batch_size):
            res += broadcasted_translator.value(sequences[i:i+batch_size])
        df["translation"] = [item["translation_text"] for item in res]
        yield df

schema = "text STRING, translation STRING"
sdf = spark_dataframe.mapInPandas(compute_values, schema=schema)

I get the following error:

  File "/tmp/conda-78ffd793-e3a4-4b56-a869-cedd86c5eeaa/real/envs/conda-env/lib/python3.8/site-packages/pyspark/broadcast.py", line 129, in load
    return pickle.load(file)
  File "/tmp/conda-78ffd793-e3a4-4b56-a869-cedd86c5eeaa/real/envs/conda-env/lib/python3.8/site-packages/transformers/models/marian/tokenization_marian.py", line 330, in __setstate__
    self.spm_source, self.spm_target = (load_spm(f, self.sp_model_kwargs) for f in self.spm_files)
  File "/tmp/conda-78ffd793-e3a4-4b56-a869-cedd86c5eeaa/real/envs/conda-env/lib/python3.8/site-packages/transformers/models/marian/tokenization_marian.py", line 330, in <genexpr>
    self.spm_source, self.spm_target = (load_spm(f, self.sp_model_kwargs) for f in self.spm_files)
  File "/tmp/conda-78ffd793-e3a4-4b56-a869-cedd86c5eeaa/real/envs/conda-env/lib/python3.8/site-packages/transformers/models/marian/tokenization_marian.py", line 357, in load_spm
    spm.Load(path)
  File "/tmp/conda-78ffd793-e3a4-4b56-a869-cedd86c5eeaa/real/envs/conda-env/lib/python3.8/site-packages/sentencepiece/__init__.py", line 367, in Load
    return self.LoadFromFile(model_file)
  File "/tmp/conda-78ffd793-e3a4-4b56-a869-cedd86c5eeaa/real/envs/conda-env/lib/python3.8/site-packages/sentencepiece/__init__.py", line 171, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
OSError: Not found: "/container_e165_1645611551581_313304_01_000001/tmp/model_dir/source.spm": No such file or directory Error #2

Expected behavior

As I've explained in the "Information" section, I should be able to serialize, broadcast, unserialize and apply the tokenizer/model/pipeline within the worker nodes. However, it fails because __setstate__ is called and it tries to reload the tokenizer files (source.spm, target.spm, etc.) from a filesystem which is not available to the worker nodes. The __setstate__ method shouldn't access the filesystem.

patrickvonplaten commented 2 years ago

The reason seems to be due to the sentence piece library: https://github.com/google/sentencepiece -> should we maybe post the issue there?

Otherwise @candalfigomoro, could you maybe try to use the tokenizers libraries instead of sentencepiece? https://github.com/huggingface/tokenizers

candalfigomoro commented 2 years ago

Otherwise @candalfigomoro, could you maybe try to use the tokenizers libraries instead of sentencepiece? https://github.com/huggingface/tokenizers

@patrickvonplaten Is there a specific tokenizer class in the tokenizers library that I can use as drop-in replacement for MarianTokenizer in a translation pipeline?

Something like pipeline("translation", tokenizer=<TOKENIZER FROM TOKENIZERS LIBRARY>)? Sorry but I'm new to huggingface's libraries.

patrickvonplaten commented 2 years ago

Yes you should be able to use:

from transformers import MarianTokenizerFast

tokenizer = MarianTokenizerFast.from_pretrained(...)

Sorry thanks to @SaulLu , I just noticed that we don't have a fast implementation for MarianTokenizer :-/ We should work on adding this one though (I'll put it on my TODO list)

Narsil commented 2 years ago

Let me know if you need help, I vaguely remember there was a reason it wasn't added, but I can't put my finger on it atm.

patil-suraj commented 2 years ago

@Narsil @patrickvonplaten

IIRC the fast tokenizer was not added because Marian uses two setencepiece models and two vocabs (for source and target) and it's not clear how to add a fast version for such tokenizers

Narsil commented 2 years ago

@patil-suraj Can't we just have two different tokenizer_{source, target}.json files like you have 2 different spm files ?

Also ideally I wouldn't mutate the state of the tokenizer like as_target_tokenizer() does. If those 2 objects are different, they should stay 2 different objects. This could be a different PR, but I feel it would also solve issues like the fact that len(tokenizer) doesn't match (There was a recent issue but I can't find it anymore). It's quite important for fast tokenizers too because they handle all the special and added tokens, which would be super confusing if it's the same "object" but they would have 2 different things added. (How do you deal with that in the slow tokenizer ?)

patil-suraj commented 2 years ago

Can't we just have two different tokenizer_{source, target}.json files like you have 2 different spm files ?

I actually don't know about this. I have the same question.

It's quite important for fast tokenizers too because they handle all the special and added tokens, which would be super confusing if it's the same "object" but they would have 2 different things added

Aah, good catch! Yes, you are right. The current design is not ideal.

How do you deal with that in the slow tokenizer ?

In Marian the spm files are only used to tokenize the text and then the vocab is used to convert token to id and vice-versa. And before https://github.com/huggingface/transformers/pull/15831 marian used a joint vocab file, so we didn't need to handle this.

But now marian can also have two vocab files (for source and target) and adding tokens is actually not handled correctly.

Narsil commented 2 years ago

But now marian can also have two vocab files (for source and target) and adding tokens is actually not handled correctly.

I see ! Ok, let me know if you start working on this, I can help with the tokenizers part.

candalfigomoro commented 2 years ago

I've the same issue with this model https://huggingface.co/MoritzLaurer/mDeBERTa-v3-base-mnli-xnli used as Zero Shot Classifier (instead, this other model https://huggingface.co/joeddav/xlm-roberta-large-xnli works fine as Zero Shot Classifier).

So it seems like the deberta_v2/tokenization_deberta_v2.py tokenizer has the same problem. Related to https://github.com/huggingface/transformers/pull/15529 ?

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

candalfigomoro commented 2 years ago

@patrickvonplaten @patil-suraj @Narsil So the solution would be to implement a fast tokenizer for Marian?

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.