Some tokenizers are not really picklable

ben-davidson-6 commented 3 years ago

Environment info

transformers version: 4.9.2
Platform: macOS-10.16-x86_64-i386-64bit
Python version: 3.8.3
PyTorch version (GPU?): 1.7.1 (False)
Tensorflow version (GPU?): 2.3.0 (False)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help

@LysandreJik

Information

The xlmr tokenizer is not really picklable, in that it depends on things on disk to be unpickled. This causes issues if you want to use tokenizers in a spark udf, which will pickle the tokenizer, and send it to other nodes to execute, as these other nodes will not have the same things on disk.

The only tokenizer I know this happens with is XLMRobertaTokenizer but I imagine there may be more.

To reproduce

import pickle
import os

import sentencepiece as spm
from transformers import XLMRobertaTokenizer

# location on disk of tokenizer
tokenizer_directory = './xlmrBaseLocal'

def unpickle_when_file_in_same_place_and_when_it_isnt(pickled_tokenizer):
    # this works because the vocab file hasnt moved
    pickle.loads(pickled_tokenizer)
    print('successfully unpickled when file NOT MOVED')

    # we move the vocab file and try to unpickle
    os.rename(tokenizer_directory, tokenizer_directory + 'Moved')
    try:
        pickle.loads(pickled_tokenizer)
        print('successfully unpickled when file MOVED')
    except OSError:
        print('failed to unpickle when file MOVED')

    # put tokenizer back
    os.rename(tokenizer_directory + 'Moved', tokenizer_directory)

# load tokenizer and pickle it
tokenizer = XLMRobertaTokenizer.from_pretrained(tokenizer_directory)
pickled_tokenizer = pickle.dumps(tokenizer)

# this prints 
# > successfully unpickled when file NOT MOVED
# > failed to unpickle when file MOVED
unpickle_when_file_in_same_place_and_when_it_isnt(pickled_tokenizer)

# fix the pickling defined here
# https://github.com/huggingface/transformers/blob/master/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py#L171
def __getstate__(self):
    state = self.__dict__.copy()
    state["sp_model"] = None
    state["sp_model_proto"] = self.sp_model.serialized_model_proto()
    return state
def __setstate__(self, d):
    self.__dict__ = d
    # for backward compatibility
    if not hasattr(self, "sp_model_kwargs"):
        self.sp_model_kwargs = {}
    self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
    self.sp_model.LoadFromSerializedProto(self.sp_model_proto)
XLMRobertaTokenizer.__getstate__ = __getstate__
XLMRobertaTokenizer.__setstate__ = __setstate__

# repickle
tokenizer = XLMRobertaTokenizer.from_pretrained(tokenizer_directory)
pickled_tokenizer = pickle.dumps(tokenizer)

# this prints 
# > successfully unpickled when file NOT MOVED
# > successfully unpickled when file MOVED
unpickle_when_file_in_same_place_and_when_it_isnt(pickled_tokenizer)

Expected behavior

The expected behaviour would be that once the tokenizer is pickled and I have the prerequisite libraries, I should be able to unpickle it regardless of what is on disk and where.

LysandreJik commented 3 years ago

Hello, thanks you for opening this issue! Do you want to open a PR with your fix?

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers