huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.92k stars 777 forks source link

Issue merging across whitespaces #1475

Closed henrycharlesworth closed 5 months ago

henrycharlesworth commented 6 months ago

My use case is that I have a corpus of instructions, e.g. taking a small snapshot:

push rbp <eoi>
push rbx <eoi>
mov rbx , rdi <eoi>
sub rsp , 0x10 <eoi>
mov rdi , qword PTR <STACK_ADDR> <eoi>

Each line is a separate "instruction", and I want to allow merges across the whole instruction (including whitespaces). My first attempt at this is:

tokenizer = Tokenizer(BPE())
tokenizer.normalizer = NormalizerSequence([Replace(",", "")])
tokenizer.pre_tokenizer = PreTokenizerSequence([Split("\n", behavior="removed")])

trainer = BpeTrainer(vocab_size=10000, min_frequency=2, special_tokens=["<s>", "<pad>", "</s>", "<unk>"])
tokenizer.train(files=["experimenting/tokenizer_tmp/full_corpus_merged_split_0.txt"], trainer=trainer)

This basically seems to work, although there are a few issues. One is that it's hard to interpret the merge list in the json file when I save it - which I suspect it related to the main issue, which is that I can't load the tokenizer after training it.

When I run:

loaded_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")

I get an error: Exception: data did not match any variant of untagged enum ModelWrapper at line 11182 column 3

It seems like it is related to: https://github.com/huggingface/tokenizers/issues/566 I followed to: https://github.com/huggingface/tokenizers/pull/909 which hasn't been merged, and suggests that merging with spaces probably isn't supported yet (without this PR). But things largely work up until the point where I try to load the tokenizer, which is strange.

Does anyone have any suggestions for getting around this, or an alternative approach to do the same thing?

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

wrongbad commented 4 months ago

I was running into a similar issue. I also wanted to model the text whitespace/formatting, without "losing" it to the pre-tokenizer.

I think I found a workaround:

tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()
tokenizer.decoder = decoders.Metaspace()

The Metaspace class uses a rare unicode character as a placeholder for space, which solves the json file reading issue.