r-three / common-pile

Repo to hold code and track issues for the collection of permissively licensed data
MIT License
22 stars 6 forks source link

Add Dolma Counter script #85

Closed blester125 closed 4 months ago

blester125 commented 4 months ago

This PR adds a new dolma processor that can be used to count the number of (whitespace delimited) tokens in a data source.

You can point it at a directory and it will find all the .jsonl.gz files in it or it's subdirectories.

It can be used from the root dir via python -m licensed_pile.count or from anywhere with count-tokens-dolma.

One weird aspect is that it uses prefixed like giga or tera "tokens" instead of a billion, a trillion, etc.

baberabb commented 4 months ago

One weird aspect is that it uses prefixed like giga or tera "tokens" instead of a billion, a trillion, etc.

Thats just what tqdm defaults to if you use unit_scale = True iirc

blester125 commented 4 months ago

You're right, it not really configurable in dolma tho https://github.com/allenai/dolma/blob/64886d9db15bd99acea9e28740ae20a510875dfb/python/dolma/core/parallel.py#L268

I think it fine to leave it as is lol, plus the unit_divisor defaults to 1000 https://github.com/tqdm/tqdm?tab=readme-ov-file#parameters so it'll give us the right values :woman_shrugging:

craffel commented 4 months ago

Very cool, thank you for doing this. Since "token" is sort of an overloaded term, and since I'm suggesting you also report the decompressed text-only (non-json-overhead) byte count, perhaps we can call this something about "size statistics" instead of tokens.