huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.31k stars 26.35k forks source link

❓ Define tokenizer from `tokenizers` as a `PreTrainedTokenizer` #14513

Closed pietrolesci closed 2 years ago

pietrolesci commented 2 years ago

Hi there,

I defined a simple whitespace tokenizer using the tokenizers library and I would like to integrate it with the transformers ecosystem. As an example, I would like to be able to use it with the DataCollatorWithPadding. Is there a way to easily (i.e., non-hacky) integrate tokenizers from tokenizers library and the PreTrainedTokenizer class?

For reference, please find below the code for the whitespace tokenizer.

Thanks a lot in advance for your help.

Best, Pietro

class WordTokenizer:  # <- Maybe subclassing here?
    def __init__(self, max_vocab_size=30_000, unk_token="[UNK]", pad_token="[PAD]"):
        self.max_vocab_size = max_vocab_size
        self.unk_token = unk_token
        self.pad_token = pad_token
        self.tokenizer, self.trainer = self._build_tokenizer()
        os.environ["TOKENIZERS_PARALLELISM"] = "true"

    def _build_tokenizer(self):
        tokenizer = Tokenizer(WordLevel(unk_token=self.unk_token))
        tokenizer.normalizer = BertNormalizer()
        tokenizer.pre_tokenizer = Sequence([Digits(), Punctuation(), WhitespaceSplit()])
        trainer = WordLevelTrainer(vocab_size=self.max_vocab_size, special_tokens=[self.pad_token, self.unk_token])
        return tokenizer, trainer

    def __call__(self, text_column, batch):
        return {"input_ids": [enc.ids for enc in self.tokenizer.encode_batch(batch[text_column])]}

    @staticmethod
    def _batch_iterator(hf_dataset, batch_size, text_column):
        for i in range(0, len(hf_dataset), batch_size):
            yield hf_dataset[i : i + batch_size][text_column]

    def fit(self, hf_dataset, batch_size=1_000, text_column="text"):
        self.tokenizer.train_from_iterator(
            self._batch_iterator(hf_dataset, batch_size, text_column), trainer=self.trainer, length=len(hf_dataset)
        )
        self.vocab_size = self.tokenizer.get_vocab_size()
LysandreJik commented 2 years ago

Hello, have you taken a look at the following documentation? https://huggingface.co/transformers/fast_tokenizers.html

It showcases how to handle tokenizers from the tokenizer library within transformers. Let me know if it helps!

pietrolesci commented 2 years ago

Hi @LysandreJik,

Thanks a lot for your swift reply.

That's exactly what I was looking for. It's a shame I did not get it before asking (I even tried to write my own way of subclassing PreTrainedTokenizer 😄 !).

Once again, really thanks a lot for your help!

Best, Pietro

pietrolesci commented 2 years ago

Hi @LysandreJik,

Just as feedback, running the example in the doc, I noticed that the special tokens are not directly transferred from the Tokenizer to the PreTrainedTokenizerFast (e.g., unk_token, pad_token).

I hope this can be useful.

Best, Pietro

LysandreJik commented 2 years ago

Thanks for the heads-up, pinging @SaulLu and @sgugger for knowledge

SaulLu commented 2 years ago

Thanks for the feedback @pietrolesci ! :hugs:

It makes me think that maybe we should explain this point in the documentation shared by LysandreJik because indeed PreTrainedTokenizer has no way to automatically know which tokens of the tokenizer correspond to the unk_token, cls_token etc.

But if you ever see an automatic way to do it, I'd be really happy to discuss it!

pietrolesci commented 2 years ago

Hi @SaulLu,

I agree with you that it's non-trivial to do that. I can share my big hack below. For context, I want to define a simple WhiteSpace tokenizer. My hack is manually creating a special_token_map on the original tokenizer. The challenge is that even the underlying tokenizer does not store the named special tokens (apart from the unk_token which is available in tokenizer.model).

I hope this helps.

Best, Pietro

class WordTokenizer(PreTrainedTokenizerFast):

    def __init__(self, **kwargs):
        self._tokenizer, self._trainer = self._build_tokenizer(**kwargs)
        os.environ["TOKENIZERS_PARALLELISM"] = "true"

    def _build_tokenizer(self, **kwargs):
        pad_token = kwargs.get("pad_token", "[PAD]")
        unk_token = kwargs.get("unk_token", "[UNK]")
        max_vocab_size = kwargs.get("max_vocab_size", 50_000)

        tokenizer = Tokenizer(WordLevel(unk_token=unk_token))
        tokenizer.normalizer = BertNormalizer()
        tokenizer.pre_tokenizer = Sequence([Digits(), Punctuation(), WhitespaceSplit()])
        trainer = WordLevelTrainer(
            vocab_size=max_vocab_size,
            special_tokens=[pad_token, unk_token],
        )

        tokenizer.special_tokens_map = {"pad_token": pad_token, "unk_token": unk_token}
        return tokenizer, trainer

    @staticmethod
    def _batch_iterator(hf_dataset, batch_size, text_column):
        for i in range(0, len(hf_dataset), batch_size):
            yield hf_dataset[i : i + batch_size][text_column]

    def fit(self, hf_dataset, batch_size=1_000, text_column="text"):     

        self._tokenizer.train_from_iterator(
            self._batch_iterator(hf_dataset, batch_size, text_column), 
            trainer=self._trainer, 
            length=len(hf_dataset),
        )
        super().__init__(tokenizer_object=self._tokenizer)    
        setattr(self, "model_input_names", ["input_ids"])
        for k, v in self._tokenizer.special_tokens_map.items():
            setattr(self, k, v)
SaulLu commented 2 years ago

Thank you very much for your answer @pietrolesci. I'm glad to read your solution, it's always very interesting to see how you use the libraries and what difficulties you're facing!