Closed henrycharlesworth closed 5 months ago
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
I was running into a similar issue. I also wanted to model the text whitespace/formatting, without "losing" it to the pre-tokenizer.
I think I found a workaround:
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()
tokenizer.decoder = decoders.Metaspace()
The Metaspace
class uses a rare unicode character as a placeholder for space, which solves the json file reading issue.
My use case is that I have a corpus of instructions, e.g. taking a small snapshot:
Each line is a separate "instruction", and I want to allow merges across the whole instruction (including whitespaces). My first attempt at this is:
This basically seems to work, although there are a few issues. One is that it's hard to interpret the merge list in the json file when I save it - which I suspect it related to the main issue, which is that I can't load the tokenizer after training it.
When I run:
I get an error:
Exception: data did not match any variant of untagged enum ModelWrapper at line 11182 column 3
It seems like it is related to: https://github.com/huggingface/tokenizers/issues/566 I followed to: https://github.com/huggingface/tokenizers/pull/909 which hasn't been merged, and suggests that merging with spaces probably isn't supported yet (without this PR). But things largely work up until the point where I try to load the tokenizer, which is strange.
Does anyone have any suggestions for getting around this, or an alternative approach to do the same thing?