noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model
MIT License
1.42k stars 65 forks source link

Fix slowdown after generating backslash escape sequences #91

Closed AriX closed 5 months ago

AriX commented 5 months ago

Addresses an issue I've seen which I believe is the same as what's reported in #90: an edge case where JSONSchemaParser's shortcut_key implementation did not consider the correct parser while finishing parsing an escape sequence. When the UnionParser that StringParsingState pops onto the stack after getting a BACKSLASH is finished but not yet removed from the stack, shortcut_key fails to return json_freetext.

noamgat commented 5 months ago

OK, now I think I understand the problem you reported and the solution you proposed. However I think your specific solution may create illegal outputs, as if the topmost parser still has potential characters that can be inserted, we are not necessarily returning the correct results. I think the safe thing to do, which will also fix the performance problem, is to check (iteratively) if the topmost parser is can_end=True and get_allowed_characters() is empty - because that means that it is effectively done and will be ignored.

Better yet, this check should happen not inside the shortcut key, but during the jsonschemaparser itself - if a member in the stack is in can_end=True and get_allowed_characters()=empty, it shouldn't be there anymore because we know what will happen next time. I think this is the correct solution for this problem. If you want, you can implement it that way. If not, I will try to get to it.

AriX commented 5 months ago

I think the safe thing to do, which will also fix the performance problem, is to check (iteratively) if the topmost parser is can_end=True and get_allowed_characters() is empty - because that means that it is effectively done and will be ignored.

Good point! I just implemented this.

Better yet, this check should happen not inside the shortcut key, but during the jsonschemaparser itself - if a member in the stack is in can_end=True and get_allowed_characters()=empty, it shouldn't be there anymore because we know what will happen next time. I think this is the correct solution for this problem.

That sounds very reasonable but I'm not sure exactly where in the lifecycle of JSONSchemaParser you had mind for this check to happen.

For now, I implemented the former change. If you want to give more specific guidance I'm happy to try the latter.

AriX commented 5 months ago

However I think your specific solution may create illegal outputs, as if the topmost parser still has potential characters that can be inserted, we are not necessarily returning the correct results.

On 2nd thought, I'm not totally sure this is true. If the topmost parser's "can end" is true, then the JSONSchemaParser's get_allowed_characters() already includes the allowed characters of the next parser, without regard for whether or not the topmost parser has allowed characters left. It seems correct for the logic in shortcut_key() should match the logic in get_allowed_characters().

Let me know if I'm missing something.

noamgat commented 5 months ago

get_allowed_characters() not of the JSONSchemaParser, but of a specific object in its stack. If the specific parser is "finished but not removed", it will have can_end=True and get_allowed_character=''

I've implemented it in this branch: https://github.com/noamgat/lm-format-enforcer/tree/feature/escaping_performance

can you check if it solves the performance issue you are encountering?

AriX commented 5 months ago

get_allowed_characters() not of the JSONSchemaParser, but of a specific object in its stack. If the specific parser is "finished but not removed", it will have can_end=True and get_allowed_character=''

Yeah, I got that. I guess I understand you original point now, in that if the topmost parser has some characters that are allowed but wouldn't be allowed by json_freetext, we might prevent those from being generated.

can you check if it solves the performance issue you are encountering?

Yes, it will take me a few minutes as I don't use lm-format-enforcer directly but maintain a port of it on the side

noamgat commented 5 months ago

You can also see the diff here: https://github.com/noamgat/lm-format-enforcer/commit/d5dcdc54354d75de4b7f1ffcacbf02516fd9fe0a and apply it locally

AriX commented 5 months ago

You can also see the diff here: https://github.com/noamgat/lm-format-enforcer/commit/d5dcdc54354d75de4b7f1ffcacbf02516fd9fe0a and apply it locally

My port is in Swift so applying the patches takes some work :) May open-source it some day.

You can also see the diff here: https://github.com/noamgat/lm-format-enforcer/commit/d5dcdc54354d75de4b7f1ffcacbf02516fd9fe0a and apply it locally

This change doesn't work for me; in my tests it strips StringParserState off of the stack too early, preventing its parsedString from being read.

I can't reproduce this in the Python implementation. I suspect the reason is because my version has a stricter whitespace policy, which means after reading a string my string parser has 0 allowed characters where this implementation has whitespace in its allowed characters.

So, for my use case I think I need to go with the version that alters the shortcut_key() implementation to check for can_end() and empty get_allowed_characters().

But, of course I defer to you if you'd prefer to use an implementation like what you have in feature/escaping_performance here.

noamgat commented 5 months ago

You're right, added handling of the StringParser in this commit: https://github.com/noamgat/lm-format-enforcer/commit/34032f6b34c433545c54690223752a4bd21fd058 Unit tests now pass

noamgat commented 5 months ago

I merged the holistic solution together with the string parser fix to master. Thank you so much for spotting this issue, the unit test suite now runs twice as fast in the CI! This is a great find. I will be closing this PR soon.

AriX commented 5 months ago

Thank you so much for spotting this issue, the unit test suite now runs twice as fast in the CI! This is a great find.

That's great - glad I could help!