Closed blester125 closed 4 months ago
One weird aspect is that it uses prefixed like giga or tera "tokens" instead of a billion, a trillion, etc.
Thats just what tqdm
defaults to if you use unit_scale = True
iirc
You're right, it not really configurable in dolma tho https://github.com/allenai/dolma/blob/64886d9db15bd99acea9e28740ae20a510875dfb/python/dolma/core/parallel.py#L268
I think it fine to leave it as is lol, plus the unit_divisor
defaults to 1000 https://github.com/tqdm/tqdm?tab=readme-ov-file#parameters so it'll give us the right values :woman_shrugging:
Very cool, thank you for doing this. Since "token" is sort of an overloaded term, and since I'm suggesting you also report the decompressed text-only (non-json-overhead) byte count, perhaps we can call this something about "size statistics" instead of tokens.
This PR adds a new dolma processor that can be used to count the number of (whitespace delimited) tokens in a data source.
You can point it at a directory and it will find all the
.jsonl.gz
files in it or it's subdirectories.It can be used from the root dir via
python -m licensed_pile.count
or from anywhere withcount-tokens-dolma
.One weird aspect is that it uses prefixed like giga or tera "tokens" instead of a billion, a trillion, etc.