huggingface / tokenizers

šŸ’„ Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
9.05k stars 801 forks source link

Disable pretty-print when saving tokenizer.json files #1656

Open xenova opened 1 month ago

xenova commented 1 month ago

Feature request

As the vocabulary of newer models, like Llama 3 or Gemma, increases in size, so does the size of the tokenizer, which includes the vocabulary as JSON (and merges for BPE tokenizers). Pretty-printing these files for serialization introduces a significant overhead as whitespace around the vocabulary and/or merges is added to the file.

This issue is even worse after the new BPE serialization update, which replaces merges like "s1 s2" with ["s1", "s2"], which is now formatted to be on separate lines:

image

From quick testing, not pretty-printing the tokenizer.json reduces the file size from 17MB to 7MB.

Understandably, pretty-printing the file can help with debugging, but for those cases, it's probably better that the default is not formatted (and have a flag for outputting with formatting).

cc @ArthurZucker (PS: I can move this to huggingface/tokenizers if it is more applicable there.

Motivation

To reduce the file sizes (and bandwidth) of downloading, serializing, and uploading these files. In particular, this will greatly benefit Transformers.js users, where bandwidth is important.

Your contribution

-

ArthurZucker commented 1 month ago

We already have a pretty argument in tokenizers but we should give a bit more granularity