huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.93k stars 777 forks source link

Profile-Guided Optimization (PGO) benchmark results #1426

Closed zamazan4ik closed 7 months ago

zamazan4ik commented 8 months ago

Hi!

Writing this for the history. Maybe these results will be interesting to someone who trying to achieve better performance with tokenizers since the project cares about performance.

I test Profile-Guided Optimization (PGO) on different kinds of software - the current results are available here (with a lot of other PGO-related information). That's why I tried to optimize tokenizers with PGO too.

Test environment

I performed tests on my Linux-based machine.

Linux:

Benchmarks

As a benchmark, I use built-in benchmarks with cargo bench -- --verbose command from the Makefile (if you want to reproduce my results - please check https://github.com/huggingface/tokenizers/issues/1425 before). For the PGO optimization phase, I use cargo-pgo with cargo pgo optimize bench -- --verbose. For the PGO training phase, I use the same benchmark with cargo pgo bench -- --verbose.

Results

I got the following results:

As you see, in general, the Tokenizers' performance can be improved with PGO. I think this information can be written somewhere into the documentation, so users will be aware of PGO effects on the Tokenizers' performance and can decide to apply PGO for their Tokenizers' builds.

I already see some PGO mentions in the CI scripts but it's not clear - are Tokenizers packages PGO-optimized or not. As far as I can understand from the build scripts - they are not (but I could be wrong - please correct me in this case).

Please treat the issue just as a benchmark report - it's not an actual error, crash, or something like that.

Narsil commented 8 months ago

Thanks for opening this.

If I read correctly, the improvements are in the 5-10% range, correct ? Overall that's nice, but those benchmarks are not really representative enough to be used currently.

The reason is that tokenizers is made super modular (in order to support many different kinds of tokenizers, pretty much all in ML). And performance is highly related to the combo choice of normalizers/pre_tokenizers/models. Therefore I wouldn't use PGO just yet.

If you care about tokenizer performance that bad (in ML it's now mostly negligible runtime since it's not Python anymore), I encourage you to look at : https://github.com/microsoft/BlingFire which claims even faster tokenization (fastest claim I'm aware of). There are also other libraries out there which claim faster performance.

tokenizers being very general cannot be the fastest library compared to highly specialized code for a given tokenizer. In the real of LM though, it shouldn't matter that much anymore

zamazan4ik commented 8 months ago

If I read correctly, the improvements are in the 5-10% range, correct ?

In general - yes, you are right. However, in some tests like "BPE GPT2 encode, no cache" improvements are up to 20%

but those benchmarks are not really representative enough to be used currently.

Hmm, it's interesting. What is the current purpose of these benchmarks?

Therefore I wouldn't use PGO just yet.

Fair point. Even if you don't want to integrate PGO into the Tokenizers build pipeline with some predefined PGO workload - that's completely fine, I understand the difficulty of this way. At least the numbers above could be interesting for the Tokenizers users who care about performance (and have no way/time/money) to switch to another tokenizer implementation. I hope the results are visible enough in this issue :)

Thanks a lot for the links to other tokenizers - I will try to optimize them with PGO as well.

Narsil commented 8 months ago

Hmm, it's interesting. What is the current purpose of these benchmarks?

Well to have an idea of how tokenizers works on a particularly useful task, not enough to guide PGO :) And yes performance is most likely biased towards that particular tokenizer (It is in general biased towards space separated tokenizers, which are less and less used)

github-actions[bot] commented 7 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.