As the vocabulary of newer models, like Llama 3 or Gemma, increases in size, so does the size of the tokenizer, which includes the vocabulary as JSON (and merges for BPE tokenizers). Pretty-printing these files for serialization introduces a significant overhead as whitespace around the vocabulary and/or merges is added to the file.
This issue is even worse after the new BPE serialization update, which replaces merges like "s1 s2" with ["s1", "s2"], which is now formatted to be on separate lines:
From quick testing, not pretty-printing the tokenizer.json reduces the file size from 17MB to 7MB.
Understandably, pretty-printing the file can help with debugging, but for those cases, it's probably better that the default is not formatted (and have a flag for outputting with formatting).
cc @ArthurZucker
(PS: I can move this to huggingface/tokenizers if it is more applicable there.
Motivation
To reduce the file sizes (and bandwidth) of downloading, serializing, and uploading these files. In particular, this will greatly benefit Transformers.js users, where bandwidth is important.
Feature request
As the vocabulary of newer models, like Llama 3 or Gemma, increases in size, so does the size of the tokenizer, which includes the vocabulary as JSON (and merges for BPE tokenizers). Pretty-printing these files for serialization introduces a significant overhead as whitespace around the vocabulary and/or merges is added to the file.
This issue is even worse after the new BPE serialization update, which replaces merges like
"s1 s2"
with["s1", "s2"]
, which is now formatted to be on separate lines:From quick testing, not pretty-printing the tokenizer.json reduces the file size from 17MB to 7MB.
Understandably, pretty-printing the file can help with debugging, but for those cases, it's probably better that the default is not formatted (and have a flag for outputting with formatting).
cc @ArthurZucker (PS: I can move this to
huggingface/tokenizers
if it is more applicable there.Motivation
To reduce the file sizes (and bandwidth) of downloading, serializing, and uploading these files. In particular, this will greatly benefit Transformers.js users, where bandwidth is important.
Your contribution
-