Closed pietrolesci closed 3 months ago
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Would something like a log do the trick for you @pietrolesci ? Cuz I plan on adding some log helper to help non rust users debug the internals
Hi @ArthurZucker, thanks for replying in the thread!
In the end, I managed to hack my way into logging (i) all the possible merges and (ii) the merges actually implemented. In my case, this is useful to study the effect of different merges. It would be great if there was a simple flag to choose to log to a text file this information directly. I don't think, for my use-case, that anything more complicated is required
Hi there,
I was wondering whether there is an easy way to ask the tokeniser trainer to return the counts (or frequency) of the pair in the moment the merge decision is made. The ideal output would be having a new column with the frequency of the pair in the
merges.txt
file.Tagging the line below as it seems the relevant part of the code that has the info (i.e.,
pair_counts
)https://github.com/huggingface/tokenizers/blob/25aee8b88c8de3c5a52e2f9cb6281d6df00ad516/tokenizers/src/models/bpe/trainer.rs#L465
Thanks a lot for your help!