Closed pietrolesci closed 2 years ago
Hello, have you taken a look at the following documentation? https://huggingface.co/transformers/fast_tokenizers.html
It showcases how to handle tokenizers from the tokenizer library within transformers
. Let me know if it helps!
Hi @LysandreJik,
Thanks a lot for your swift reply.
That's exactly what I was looking for. It's a shame I did not get it before asking (I even tried to write my own way of subclassing PreTrainedTokenizer
😄 !).
Once again, really thanks a lot for your help!
Best, Pietro
Hi @LysandreJik,
Just as feedback, running the example in the doc, I noticed that the special tokens are not directly transferred from the Tokenizer
to the PreTrainedTokenizerFast
(e.g., unk_token
, pad_token
).
I hope this can be useful.
Best, Pietro
Thanks for the heads-up, pinging @SaulLu and @sgugger for knowledge
Thanks for the feedback @pietrolesci ! :hugs:
It makes me think that maybe we should explain this point in the documentation shared by LysandreJik because indeed PreTrainedTokenizer
has no way to automatically know which tokens of the tokenizer correspond to the unk_token
, cls_token
etc.
But if you ever see an automatic way to do it, I'd be really happy to discuss it!
Hi @SaulLu,
I agree with you that it's non-trivial to do that. I can share my big hack below. For context, I want to define a simple WhiteSpace tokenizer. My hack is manually creating a special_token_map
on the original tokenizer. The challenge is that even the underlying tokenizer does not store the named special tokens (apart from the unk_token
which is available in tokenizer.model
).
I hope this helps.
Best, Pietro
class WordTokenizer(PreTrainedTokenizerFast):
def __init__(self, **kwargs):
self._tokenizer, self._trainer = self._build_tokenizer(**kwargs)
os.environ["TOKENIZERS_PARALLELISM"] = "true"
def _build_tokenizer(self, **kwargs):
pad_token = kwargs.get("pad_token", "[PAD]")
unk_token = kwargs.get("unk_token", "[UNK]")
max_vocab_size = kwargs.get("max_vocab_size", 50_000)
tokenizer = Tokenizer(WordLevel(unk_token=unk_token))
tokenizer.normalizer = BertNormalizer()
tokenizer.pre_tokenizer = Sequence([Digits(), Punctuation(), WhitespaceSplit()])
trainer = WordLevelTrainer(
vocab_size=max_vocab_size,
special_tokens=[pad_token, unk_token],
)
tokenizer.special_tokens_map = {"pad_token": pad_token, "unk_token": unk_token}
return tokenizer, trainer
@staticmethod
def _batch_iterator(hf_dataset, batch_size, text_column):
for i in range(0, len(hf_dataset), batch_size):
yield hf_dataset[i : i + batch_size][text_column]
def fit(self, hf_dataset, batch_size=1_000, text_column="text"):
self._tokenizer.train_from_iterator(
self._batch_iterator(hf_dataset, batch_size, text_column),
trainer=self._trainer,
length=len(hf_dataset),
)
super().__init__(tokenizer_object=self._tokenizer)
setattr(self, "model_input_names", ["input_ids"])
for k, v in self._tokenizer.special_tokens_map.items():
setattr(self, k, v)
Thank you very much for your answer @pietrolesci. I'm glad to read your solution, it's always very interesting to see how you use the libraries and what difficulties you're facing!
Hi there,
I defined a simple whitespace tokenizer using the
tokenizers
library and I would like to integrate it with the transformers ecosystem. As an example, I would like to be able to use it with theDataCollatorWithPadding
. Is there a way to easily (i.e., non-hacky) integrate tokenizers fromtokenizers
library and thePreTrainedTokenizer
class?For reference, please find below the code for the whitespace tokenizer.
Thanks a lot in advance for your help.
Best, Pietro