jcrist / msgspec

A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML
https://jcristharif.com/msgspec/
BSD 3-Clause "New" or "Revised" License
2.09k stars 61 forks source link

Investigate constrained mapping keys, JSON Schema patternProperties #576

Closed bollwyvl closed 7 months ago

bollwyvl commented 8 months ago

Elevator Pitch

This PR explores the current support for constraining the keys of dict-like objects, where the key enables some level of "uniqueness" across a single attribute, used as the key of a mapping.

Changes

Motivation

This mostly revolves around the following kind of extension to the JSON schema example:

#: A string constrained to a Global Trade Item Number (GTIN) 
GTINString = Annotated[str, Meta(pattern=r"^\d{14}$")]

class Catalog(Struct):
    """A catalog of products"""
    skus: dict[GTINString, Product]

Which yields something like:

b'{"$ref":"#/$defs/Catalog","$defs":{"Catalog":{"title":"Catalog","description":"A catalog of products","type":"object","properties":{"skus":{"type":"object","patternProperties":{"^\\\\d{14}$":{"$ref":"#/$defs/Product"}}}},"required":["skus"]},"Product":{"title":"Product","description":"A product in a catalog","type":"object","properties":{"id":{"type":"integer"},"name":{"type":"string"},"price":{"type":"number","exclusiveMinimum":0},"tags":{"type":"array","items":{"type":"string"},"default":[]},"dimensions":{"anyOf":[{"type":"null"},{"$ref":"#/$defs/Dimensions"}],"default":null}},"required":["id","name","price"]},"Dimensions":{"title":"Dimensions","description":"Dimensions for a product, all measurements in centimeters","type":"object","properties":{"length":{"type":"number","exclusiveMinimum":0},"width":{"type":"number","exclusiveMinimum":0},"height":{"type":"number","exclusiveMinimum":0}},"required":["length","width","height"]}}}'

I've been looking into applying msgspec to e.g. the Jupyter notebook format. It makes use of a number of patterns, such as MIME types, etc. as the keys of mappings for constrained, but extensible metadata. I'm not entirely certain what this pattern (ha!) looks like yet, from an annotated-types(+/-msgspec.Meta) perspective.

      "mimebundle": {
        "description": "A mime-type keyed dictionary of data",
        "type": "object",
        "additionalProperties": {
          "description": "mimetype output (e.g. text/plain), represented as either an array of strings or a string.",
          "$ref": "#/definitions/misc/multiline_string"
        },
        "patternProperties": {
          "^application/(.*\\+)?json$": {
            "description": "Mimetypes with JSON output, can be any type"
          }
        }
      },
jcrist commented 8 months ago

Adding a schema for constraining the keys makes sense to me! I wonder if we should use propertyNames (https://json-schema.org/draft/2020-12/json-schema-core#section-10.3.2.4) instead of patternProperties though? This would work with all of the constraints available on strings (min_length, max_length, and pattern), not just pattern.

bollwyvl commented 8 months ago

No worries at all about any delays, and indeed, thanks for the quick turn on the fix! Again, this PR is just investigating how close I can get to some existing schema patterns, where msgspec (or some other declarative system) might provide some more ergonomic ways to build portable representations that preserve validation and documentation intent. My real dream is to go the other way, from JSON schema to a performant parser/validator, but codegen can go a long way, once the edge cases are possible.

propertyNames instead of patternProperties

Perhaps in addition to: while the propertyNames would provide a better validation experience for a downstream using a disallowed key (e.g. key 'foo' did not match '...', i don't believe it is intended to supplant the intent of if this pattern is found, require this schema, particularly when multiple patterns are present.

This work against main:

StringishContent = Annotated[str | list[str], Meta(description="mimetype output (e.g. text/plain), represented as either an array of strings or a string.")]
StringishMimeBundle = dict[str, StringishContent]
msgspec.json.schema(StringishMimeBundle)

{'type': 'object',
 'additionalProperties': {'description': 'mimetype output (e.g. text/plain), represented as either an array of strings or a string.',
  'anyOf': [{'type': 'string'},
   {'type': 'array', 'items': {'type': 'string'}}]}}

In this PR, this works:

JsonishMimeType = Annotated[str, Meta(pattern="^application/(.*\\+)?json$")]
JsonishContent = Annotated[dict[str, Any], Meta(description="Mimetypes with JSON output, can be any type")]
JsonishMimeBundle = dict[JsonishMimeType, JsonishContent]
msgspec.json.schema(JsonishMimeBundle)

{'type': 'object',
 'patternProperties': {'^application/(.*\\+)?json$': {'description': 'Mimetypes with JSON output, can be any type',
   'type': 'object'}}}

But trying the union of the two:

MimeBundle = StringishMimeBundle | JsonishMimeBundle
msgspec.json.schema(MimeBundle)

# ...

File ~/projects/msgspec_/msgspec/msgspec/inspect.py:725, in _Translator.run(self)
    721 def run(self):
    722     # First construct a decoder to validate the types are valid
    723     from ._core import MsgpackDecoder
--> 725     MsgpackDecoder(Tuple[self.types])
    726     return tuple(self.translate(t) for t in self.types)

TypeError: Type unions may not contain more than one dict type - type `dict[str, typing.Annotated[str | list[str], msgspec.Meta(description='mimetype output (e.g. text/plain), represented as either an array of strings or a string.')]] | dict[typing.Annotated[str, msgspec.Meta(pattern='^application/(.*\\+)?json$')], typing.Annotated[dict[str, typing.Any], msgspec.Meta(description='Mimetypes with JSON output, can be any type')]]` is not supported
jcrist commented 7 months ago

I've added support for forwarding string key constraints as propertyNames in #604.

the intent of if this pattern is found, require this schema, particularly when multiple patterns are present.

I see. The pattern you're asking for isn't really something that can be spelled with standard python type annotations, since a union outside the dict type specifying means "this dict type OR the other dict type", not a union of their key/value pairs.

Using your example above:

MimeBundle = StringishMimeBundle | JsonishMimeBundle

# This would accept a dict of stringish mime types, OR a dict of jsonish mimetypes, but NOT one that mixes them
valid1 = {"application/json": jsonish_content1, "application/foo+json": jsonish_content2}
valid2 = {"text/plain": stringish_content1, "other": stringish_content2}

invalid = {"application/json": jsonish_content1, "text/plain": stringish_content1}  # IIUC you want to support this

The best way to support this in msgspec today would be to define a custom type representing the MimeType container, then encode/decode the contents using extension hooks.

Here's a hacked together complete example:

import re
from typing import ClassVar, Any
from collections.abc import MutableMapping, Iterator

import msgspec

class MimeBundle(MutableMapping):
    patterns: ClassVar[list[tuple[str, Any]]] = [
        ("^application/(.*\\+)?json$", dict[str, Any]),
        ("^.*$", str | list[str]),
    ]

    def __init__(self, data: dict[str, Any] | None = None, **kwargs: Any):
        self._data = dict(data) if data else {}
        self._data.update(kwargs)

    def __repr__(self):
        return f"MimeBundle({self._data})"

    def __getitem__(self, key: str) -> Any:
        return self._data[key]

    def __setitem__(self, key: str, value: Any) -> None:
        self._data[key] = value

    def __delitem__(self, key: str) -> None:
        del self._data[key]

    def __iter__(self) -> Iterator[str]:
        return iter(self._data)

    def __len__(self) -> int:
        return len(self._data)

    def __msgspec_encode__(self):
        return self._data

    @classmethod
    def __msgspec_decode__(cls, obj: Any):
        if not isinstance(obj, dict):
            raise ValueError("Expected an object")
        res = cls()
        for key, value in obj.items():
            for pattern, schema in cls.patterns:
                if re.search(pattern, key):
                    try:
                        res[key] = msgspec.convert(value, schema)
                    except msgspec.ValidationError as exc:
                        raise ValueError(f"Invalid value for {key!r}: {exc}")
                    break
            else:
                raise ValueError(f"{key} is not a valid mimetype")
        return res

    @classmethod
    def __msgspec_json_schema__(cls):
        return {
            "type": "object",
            "patternProperties": {
                pattern: msgspec.json.schema(schema)
                for pattern, schema in cls.patterns
            }
        }

# Define some hooks to support the custom type.
#
# The implementations for these hooks is completely up to you - here I've opted
# to have them dispatch to methods on the type to keep the implementations
# local to the `MimeBundle` type, but you could just as easily inline the
# implementations here. Currently msgspec doesn't look for any custom methods
# on the types themselves, so the method names used below are arbitrary and
# unique to this example.
def enc_hook(value):
    try:
        return value.__msgspec_encode__()
    except AttributeError:
        raise NotImplementedError

def dec_hook(type, obj):
    try:
        return type.__msgspec_decode__(obj)
    except AttributeError:
        raise NotImplementedError

def schema_hook(type):
    try:
        return type.__msgspec_json_schema__()
    except AttributeError:
        raise NotImplementedError

# --------------------------------------------
# A demo using the functionality defined above
# --------------------------------------------

valid = """
{
    "application/json": {"fizz": "buzz"},
    "application/foo+json": {"hello": "world"},
    "text/plain": "some text",
    "other": ["some", "more", "text"]
}
"""

invalid = """
{
    "text/plain": "some text",
    "other": ["a string", "another string", 123]
}
"""

# Decode into a MimeBundle type
bundle = msgspec.json.decode(valid, type=MimeBundle, dec_hook=dec_hook)
print(bundle)
#> MimeBundle(
#>     {
#>         'application/json': {'fizz': 'buzz'},
#>         'application/foo+json': {'hello': 'world'},
#>         'text/plain': 'some text',
#>         'other': ['some', 'more', 'text']
#>     }
#> )

# Raise a nice error on an invalid MimeBundle
try:
    msgspec.json.decode(invalid, type=MimeBundle, dec_hook=dec_hook)
except Exception as exc:
    print(repr(exc))
#> ValidationError('Invalid value for 'other': Expected `str`, got `int` - at `$[2]`')

# Encode a MimeBundle type
encoded = msgspec.json.encode(bundle, enc_hook=enc_hook)
print(encoded)
#> b'{"application/json":{"fizz":"buzz"},"application/foo+json":{"hello":"world"},"text/plain":"some text","other":["some","more","text"]}'

# Generate the JSON Schema for the MimeBundle type
schema = msgspec.json.schema(MimeBundle, schema_hook=schema_hook)
print(schema)
#> {
#>     'type': 'object',
#>     'patternProperties': {
#>         '^application/(.*\\+)?json$': {'type': 'object'},
#>         '^.*$': {'anyOf': [{'type': 'string'}, {'type': 'array', 'items': {'type': 'string'}}]}
#>     }
#> }

Hopefully that's enough to get you going. If you have further questions on how to implement patterns like this in msgspec, please don't hesitate to open an issue and ask. For now though I'm going to close this PR.