Remove need for JSON validator for consts and enums

hudson-ai commented 5 days ago

Re-writes consts as other types, e.g. arrays with specific prefixItems, ints with minimum==maximum, etc. Enums are then written as anyOf(consts).

This lets us use our internal validation logic rather than relying on jsonschema.

Note: I also fixed the grammar-builder's handling of UnsatisfiableSchemaError in anyOf -- it now appropriately propagates only when there are no valid schemas.

mmoskal commented 5 days ago

Did you check that we get sensible grammars for enum: ["foo", "bar", "baz"] and enum: [1, 2, 3] and also const: 1 and const: "foo" ? in particular the regex for 1 should be /1/

I think we can live with select("foo", "bar", "baz") instead of lexeme("foo|bar|baz") for enum for now but would be good to fix at some point (compiler can detect all branches that are regexes and combine them)

hudson-ai commented 5 days ago

Did you check that we get sensible grammars for enum: ["foo", "bar", "baz"] and enum: [1, 2, 3] and also const: 1 and const: "foo" ? in particular the regex for 1 should be /1/

I think we can live with select("foo", "bar", "baz") instead of lexeme("foo|bar|baz") for enum for now but would be good to fix at some point (compiler can detect all branches that are regexes and combine them)

Currently just calling select for enums, so we get this for enum: ["foo", "bar", "baz"] (1,2,3 looks the same except they aren't treated as json_string lexemes)

[{'Join': {'sequence': [4],
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'Lexeme': {'rx': 'foo',
      'contextual': None,
      'temperature': None,
      'json_string': True,
      'json_allowed_escapes': None,
      'json_raw': None,
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'Lexeme': {'rx': 'bar',
      'contextual': None,
      'temperature': None,
      'json_string': True,
      'json_allowed_escapes': None,
      'json_raw': None,
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'Lexeme': {'rx': 'baz',
      'contextual': None,
      'temperature': None,
      'json_string': True,
      'json_allowed_escapes': None,
      'json_raw': None,
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'Select': {'among': [1, 2, 3],
      'max_tokens': None,
      'name': None,
      'capture_name': None}}]

hudson-ai commented 5 days ago

And consts will look the same but without the select. E.g. const: 1

[{'Join': {'sequence': [1],
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'Lexeme': {'rx': '(1)',
      'contextual': None,
      'temperature': None,
      'json_string': False,
      'json_allowed_escapes': None,
      'json_raw': None,
      'max_tokens': None,
      'name': None,
      'capture_name': None}}]

hudson-ai commented 5 days ago

For more complex types, e.g. {"const": {"key": "value"}}, we are getting a full grammar out of it, even though we could in principle rewrite it as a regex...

[{'Join': {'sequence': [9],
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'Lexeme': {'rx': 'value',
      'contextual': None,
      'temperature': None,
      'json_string': True,
      'json_allowed_escapes': None,
      'json_raw': None,
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'String': {'literal': '"key"',
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'String': {'literal': ': ',
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'String': {'literal': '',
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'Join': {'sequence': [2, 3, 1],
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'String': {'literal': '{',
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'String': {'literal': ', ',
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'String': {'literal': '}',
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'Join': {'sequence': [6, 5, 8],
      'max_tokens': None,
      'name': None,
      'capture_name': None}}]

mmoskal commented 5 days ago

wonderful! good to merge!

mmoskal commented 5 days ago

I don't think there's much point rewriting general consts as regexes. The one I worry about is a very large enum with just strings in it (I think it's relatively common) - it's easy to do regex and it will avoid parser limits.

hudson-ai commented 5 days ago

I don't think there's much point rewriting general consts as regexes. The one I worry about is a very large enum with just strings in it (I think it's relatively common) - it's easy to do regex and it will avoid parser limits.

Totally. And I like the idea of detecting this kind of situation in the builder itself.

mmoskal commented 5 days ago

you mean GrammarBuilder? I think it's not always valid (though it would be valid for JSON)

microsoft / llguidance

Remove need for JSON validator for consts and enums #59