noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model
MIT License
1.42k stars 65 forks source link

Any suggestions to speed up? #44

Closed pfZhu closed 7 months ago

pfZhu commented 9 months ago

In the scenario of #schemaless# JSON generation, the average time calling the function prefix_allowed_fn is 0.03 seconds as I tested, which significantly slows down the whole generation process. I wonder which part is most time-consuming? Do you have any suggestions to speed up the prefix_allowed_fn function to at least 10x times faster, except for re-implementing it using C++?

noamgat commented 9 months ago

I just merged this PR: https://github.com/noamgat/lm-format-enforcer/pull/45 which causes the full token enforcer flow to run in all unit tests. This makes the tests much slower, and allows profiling performance in a real world scenario (llama2 tokenizer + tree traversal) in order to improve performance. I hope that there are things that can be done outside of re-implementing in C, I will have a look at the unit tests' performance soon.

noamgat commented 9 months ago

image

noamgat commented 9 months ago

Quite surprisingly, for long responses, the usage of the decoder is the most expensive operation. I will investigate how this can be mitigated. If you want to have a stab at it, test_long_json_object() in the unit tests.

noamgat commented 9 months ago

I pushed a change (currently still in feature branch, needs some cleanup).

Can you try to use it using:

pip install git+https://github.com/noamgat/lm-format-enforcer.git@feature/decoding_optimization

And tell me if you see an improvement?

noamgat commented 9 months ago

image For the same unit test, this is the current result in the branch. Almost 10x faster (24s -> 2.8s)

noamgat commented 9 months ago

v0.8.0 was just released with several performance improvements, see the changelog. Can you check if updating (and using the TokenEnforcerTokenizerData concept) improves performance for your use case?

pfZhu commented 9 months ago

@noamgat Thank you for all the efforts! I am trying to test the latest performance. Will give you feedback soon😄

pfZhu commented 9 months ago

@noamgat Hi, as I tested, the latest version is only a little bit faster, about 23 ms per calling. The first function call and sometimes in the middle of generation process, it will cost more than 100 ms, which significantly enlarges the average function calling time. It is noteworthy that my vocabulary size is more than 200,000.

noamgat commented 9 months ago

Is this schemaless? Is there a way that I can reproduce your results?

pfZhu commented 9 months ago

@noamgat yes it is schemaless. Besides my multilingual vocabulary contains lots of Chinese and English tokens. I will try to give a sample code to reproduce.

noamgat commented 7 months ago

Closing the issue due to inactivity. If you attach a reproduction please reopen.