noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model
MIT License
1.42k stars 65 forks source link

WHITESPACE_CHARACTERS customisation is incompatible with list parser logic. #107

Closed JoshC8C7 closed 4 months ago

JoshC8C7 commented 4 months ago

Overriding jsonschemaparser.WHITESPACE_CHARACTERS to get more concise generation is a highly desirable feature, especially when using models that like outputting newlines. This is basically already possible by monkeypatching, but can't be used in production due to one obstacle which could potentially be removed. This issue is that the block below:

def get_allowed_control_characters(self):
        num_items = self.num_items_seen
        is_on_top = self.root.context.active_parser.object_stack[-1] == self
        if (not is_on_top) and self.root.context.active_parser.last_non_whitespace_character != "[":
            # If there is an active parser above us, and the last character is not [, 
            # there is an active item parser on the stack that we did not count yet.
            num_items += 1
        control_characters = ""
        has_enough_items = self.min_items is None or num_items >= self.min_items
        can_add_another_item = self.max_items is None or num_items < self.max_items

        if num_items > 0 and can_add_another_item:
            control_characters += ","
        if has_enough_items:
            control_characters += "]"
        return control_characters

Uses self.root.context.active_parser.last_non_whitespace_character which in turn relies on jsonschemaparser.WHITESPACE_CHARACTERS. When this excludes (e.g.) tab, then this block will allow a comma to follow tab and [ ,] to be generated, despite this being invalid json. This in practice rarely occurs, as it also requires that is_on_top is set to False (which usually happens only with schemas which have a lists which does not require a minimum number of items).

The solution is to change self.root.context.active_parser.last_non_whitespace_character != "[": to check if there are any elements between the latest '[' and the end of the generation which would count as valid datatypes in the type of the list. Alternatively, a list of whitespaces should be kept in order to complete this check, separate from those which can be customised to determine valid tokens at each step.

noamgat commented 4 months ago

If what you are trying to achieve is concise generation, the best way to do it is via the max_consecutive_whitespaces field of CharacterLevelParserConfig. If you are using code, you can do it by modifying the config object, and if you are using a vLLM server or any other code-limiting workflow, you can do it via the LMFE_MAX_CONSECUTIVE_WHITESPACES environment variable. See configuration options section of the README for more information. Closing the issue, please repoen if this is not a valid solution for you.