WHITESPACE_CHARACTERS customisation is incompatible with list parser logic.

Overriding jsonschemaparser.WHITESPACE_CHARACTERS to get more concise generation is a highly desirable feature, especially when using models that like outputting newlines. This is basically already possible by monkeypatching, but can't be used in production due to one obstacle which could potentially be removed. This issue is that the block below:

def get_allowed_control_characters(self):
        num_items = self.num_items_seen
        is_on_top = self.root.context.active_parser.object_stack[-1] == self
        if (not is_on_top) and self.root.context.active_parser.last_non_whitespace_character != "[":
            # If there is an active parser above us, and the last character is not [, 
            # there is an active item parser on the stack that we did not count yet.
            num_items += 1
        control_characters = ""
        has_enough_items = self.min_items is None or num_items >= self.min_items
        can_add_another_item = self.max_items is None or num_items < self.max_items

        if num_items > 0 and can_add_another_item:
            control_characters += ","
        if has_enough_items:
            control_characters += "]"
        return control_characters

Uses self.root.context.active_parser.last_non_whitespace_character which in turn relies on jsonschemaparser.WHITESPACE_CHARACTERS. When this excludes (e.g.) tab, then this block will allow a comma to follow tab and [ ,] to be generated, despite this being invalid json. This in practice rarely occurs, as it also requires that is_on_top is set to False (which usually happens only with schemas which have a lists which does not require a minimum number of items).

The solution is to change self.root.context.active_parser.last_non_whitespace_character != "[": to check if there are any elements between the latest '[' and the end of the generation which would count as valid datatypes in the type of the list. Alternatively, a list of whitespaces should be kept in order to complete this check, separate from those which can be customised to determine valid tokens at each step.

noamgat / lm-format-enforcer

WHITESPACE_CHARACTERS customisation is incompatible with list parser logic. #107