Closed zamazan4ik closed 7 months ago
Thanks for opening this.
If I read correctly, the improvements are in the 5-10% range, correct ? Overall that's nice, but those benchmarks are not really representative enough to be used currently.
The reason is that tokenizers
is made super modular (in order to support many different kinds of tokenizers, pretty much all in ML). And performance is highly related to the combo choice of normalizers/pre_tokenizers/models. Therefore I wouldn't use PGO just yet.
If you care about tokenizer performance that bad (in ML it's now mostly negligible runtime since it's not Python anymore), I encourage you to look at : https://github.com/microsoft/BlingFire which claims even faster tokenization (fastest claim I'm aware of). There are also other libraries out there which claim faster performance.
tokenizers
being very general cannot be the fastest library compared to highly specialized code for a given tokenizer. In the real of LM though, it shouldn't matter that much anymore
If I read correctly, the improvements are in the 5-10% range, correct ?
In general - yes, you are right. However, in some tests like "BPE GPT2 encode, no cache" improvements are up to 20%
but those benchmarks are not really representative enough to be used currently.
Hmm, it's interesting. What is the current purpose of these benchmarks?
Therefore I wouldn't use PGO just yet.
Fair point. Even if you don't want to integrate PGO into the Tokenizers build pipeline with some predefined PGO workload - that's completely fine, I understand the difficulty of this way. At least the numbers above could be interesting for the Tokenizers users who care about performance (and have no way/time/money) to switch to another tokenizer implementation. I hope the results are visible enough in this issue :)
Thanks a lot for the links to other tokenizers - I will try to optimize them with PGO as well.
Hmm, it's interesting. What is the current purpose of these benchmarks?
Well to have an idea of how tokenizers works on a particularly useful task, not enough to guide PGO :) And yes performance is most likely biased towards that particular tokenizer (It is in general biased towards space separated tokenizers, which are less and less used)
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Hi!
Writing this for the history. Maybe these results will be interesting to someone who trying to achieve better performance with
tokenizers
since the project cares about performance.I test Profile-Guided Optimization (PGO) on different kinds of software - the current results are available here (with a lot of other PGO-related information). That's why I tried to optimize
tokenizers
with PGO too.Test environment
I performed tests on my Linux-based machine.
Linux:
main
branch on commitf1c23b868006ee27acdd31796677f82fa10d6bd7
Benchmarks
As a benchmark, I use built-in benchmarks with
cargo bench -- --verbose
command from the Makefile (if you want to reproduce my results - please check https://github.com/huggingface/tokenizers/issues/1425 before). For the PGO optimization phase, I use cargo-pgo withcargo pgo optimize bench -- --verbose
. For the PGO training phase, I use the same benchmark withcargo pgo bench -- --verbose
.Results
I got the following results:
As you see, in general, the Tokenizers' performance can be improved with PGO. I think this information can be written somewhere into the documentation, so users will be aware of PGO effects on the Tokenizers' performance and can decide to apply PGO for their Tokenizers' builds.
I already see some PGO mentions in the CI scripts but it's not clear - are Tokenizers packages PGO-optimized or not. As far as I can understand from the build scripts - they are not (but I could be wrong - please correct me in this case).
Please treat the issue just as a benchmark report - it's not an actual error, crash, or something like that.