huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.92k stars 776 forks source link

❓Get stats (e.g. counts) about the merged pairs #1523

Closed pietrolesci closed 3 months ago

pietrolesci commented 4 months ago

Hi there,

I was wondering whether there is an easy way to ask the tokeniser trainer to return the counts (or frequency) of the pair in the moment the merge decision is made. The ideal output would be having a new column with the frequency of the pair in the merges.txt file.

Tagging the line below as it seems the relevant part of the code that has the info (i.e., pair_counts)

https://github.com/huggingface/tokenizers/blob/25aee8b88c8de3c5a52e2f9cb6281d6df00ad516/tokenizers/src/models/bpe/trainer.rs#L465

Thanks a lot for your help!

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ArthurZucker commented 3 months ago

Would something like a log do the trick for you @pietrolesci ? Cuz I plan on adding some log helper to help non rust users debug the internals

pietrolesci commented 3 months ago

Hi @ArthurZucker, thanks for replying in the thread!

In the end, I managed to hack my way into logging (i) all the possible merges and (ii) the merges actually implemented. In my case, this is useful to study the effect of different merges. It would be great if there was a simple flag to choose to log to a text file this information directly. I don't think, for my use-case, that anything more complicated is required