ER/RFC: Schema Inference

pkoppstein commented 9 years ago

This is partly an enhancement request, and partly a request for comments.

When confronted with a collection of JSON entities, it is often helpful to know whether there is an implicit schema, and if so, what it is. Even in the case of a single JSON document, it is often useful to have a structural overview, e.g. for navigation.

Spark SQL can infer a schema from a collection of JSON entities. It can be printed using printSchema().

The example given in the O'Reilly Spark book on p. 172 is the pair of records:

{"name": "Holden"}

{"name": "Sparky the Bear", "lovesPandas": true, "knows": {"friends": ["holden"]} }

Using the proposed schema.jq below, we find:

$ jq -s -f schema.jq spark.json
{
  "knows": {
    "friends": [
      "string"
    ]
  },
  "lovesPandas": "boolean",
  "name": "string"
}

This is equivalent to the schema inferred by Spark SQL except that:

the schema inference engine proposed here always produces a JSON entity; this facilitates the use of the inferred schemas, e.g. for integrity checking, while obviating the need for a special pretty-printer, since the results produced by the jq pretty-printer are eminently readable.
the Spark SQL schema adds "nullable" and "containsNull" annotations, whereas the proposed schema inference engine regards nulls as placeholders without any particular structural significance.

As illustrated by the above example, the absence of a key in an object also has no particular structural significance for either the Spark SQL inference engine or the one proposed here.

Three noteworthy features of the proposed schema inference engine are:

a) the introduction of "scalar" as an extended type, e.g. ["scalar"] is the extended type signifying an array of 0 or more elements of scalar type;

b) the introduction of "JSON" as an extended type, e.g. ["JSON"] is the extended type signifying an array of 0 or more elements of any type;

c) arrays are only characterized by the extended type of their elements.

Thus, the following JSON object conforms to the above-mentioned schema:

{"name": "Paddington", "lovesPandas": null, "knows": {"friends": ["holden", "Sparky"]}}

schema.jq

# Version 0.1

# This module defines three filters:
#   typeof/0 returns the extended-type of its input;
#   typeUnion(a;b) returns the union of the two specified extended-type values;
#   schema/0 returns the typeUnion of the extended-type values of the entities in the input array, if the input is an array;
#   otherwise it simply returns the "typeof" value of its input.

# Each extended type can be thought of as a set of JSON entities,
# e.g. "number" for the set of JSON numbers, and ["number"] for the
# set of JSON number-valued arrays including [].

# The extended-type values are always JSON entities.
# The possible values are:
# "null", "boolean", "string", "number";
# "scalar" for any combination of non-null scalars;
# [T] where T is an extended type;
# an object all of whose values are extended types;
# "JSON" signifying that no other extended-type value is applicable.

# The extended-type values are defined recursively:
# The extended-type of a scalar value is its JSON type.
# The extended-type of a non-empty array of values all of which have the same JSON type, t, is [t], and similarly for ["scalar"], and ["JSON"].
# The extended-type of [] is ["null"], signifying that the type of the array elements is indeterminate.
# The extended-type of an object is an object with the same keys, but the values of which are the extended-types of the corresponding values.

# typeUnion(a;b) returns the least extended-type value that subsumes both a and b.
# For example:
#  typeUnion("number"; "string") yields "scalar";
#  typeUnion({"a": "number"}; {"b": "string"}) yields {"a": "number", "b": "string"};
#  typeUnion("null", t) yields t for any valid extended type, t.

def typeUnion(a;b):
  def scalarp: . == "boolean" or . == "string" or . == "number" or . == "scalar";
  a as $a | b as $b
  | if $a == $b then $a
    elif ($a | scalarp) and ($b | scalarp) then "scalar"
    elif $a == "JSON" or $b == "JSON" then "JSON"
    elif ($a|type) == "array" and ($b|type) == "array" then [ typeUnion($a[0]; $b[0]) ]
    elif ($a|type) == "object" and ($b|type) == "object" then
      ((($a|keys) + ($b|keys)) | unique) as $keys
      | reduce $keys[] as $key ( {} ; .[$key] = typeUnion( $a[$key]; $b[$key]) )
    elif $a == "null" or $a == null then $b
    elif $b == "null" or $b == null then $a
    else "JSON"
    end ;

def typeof:
  def typeofArray:
    if length == 0 then ["null"]
    else [reduce .[] as $item (null; typeUnion(.; $item|typeof))]
    end ;
  def typeofObject:
    reduce keys[] as $key ( . ; .[$key] = (.[$key] | typeof) ) ;

  . as $in
  | type
  | if . == "string" or . == "number" or . == "null" or . == "boolean" then .
    elif . == "object" then $in | typeofObject
    else $in | typeofArray
    end ;

def schema:
  if type == "array" then reduce .[] as $x ("null";  typeUnion(.; $x|typeof))
  else typeof
  end ;

nicowilliams commented 9 years ago

I must say, this is very nice!

Here's a handy and simple schema inference program I use:

[path(..) | ["",(.[]|if type=="number" then "[]" else . end)]] |
 sort | unique | .[] | join(".") | sub("\\.\\[";"[") | sub("^\\[";".[")

What's handy about this is that it outputs jq path expressions. It needs a bit of work (to deal with object key names that need quoting because they aren't ident-like). What should we call this?

Should your and my schema utils be in 1.5, or in a module? I think this is almost a killer app for jq...

EDIT: formatting.

pkoppstein commented 9 years ago

Using your program (with \ properly escaped) on armor.json at https://github.com/CleverRaven/Cataclysm-DDA/blob/master/data/json/items/armor.json:

.[]
.[].//
.[].ammo
.[].bashing
.[].bashing_protection
.[].category
.[].charges_per_use
.[].color
.[].coverage
.[].covers
.[].covers.[]
.[].cut
.[].cutting
.[].description
.[].encumbrance
.[].environmental_protection
.[].flags
.[].flags.[]
.[].id
.[].initial_charges
.[].material
.[].material.[]
.[].material_thickness
.[].max_charges
.[].name
.[].name_plural
.[].note
.[].phase
.[].power_armor
.[].price
.[].properties
.[].properties.[]
.[].properties.[].[]
.[].qualities
.[].qualities.[]
.[].qualities.[].[]
.[].revert_to
.[].snippet_category
.[].snippet_category.[]
.[].snippet_category.[].id
.[].snippet_category.[].text
.[].storage
.[].symbol
.[].techniques
.[].techniques.[]
.[].to_hit
.[].turns_per_charge
.[].type
.[].use_action
.[].use_action.activate_msg
.[].use_action.deactive_msg
.[].use_action.need_sunlight
.[].use_action.type
.[].volume
.[].warmth
.[].weight

Using the 'schema' def in schema.jq at https://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed#file-schema-jq the result is:

$ jq -r -f schema.jq /tmp/armor.json
{
  "//": "string",
  "ammo": "string",
  "bashing": "number",
  "bashing_protection": "number",
  "category": "string",
  "charges_per_use": "number",
  "color": "string",
  "coverage": "number",
  "covers": [
    "string"
  ],
  "cut": "number",
  "cutting": "number",
  "description": "string",
  "encumbrance": "number",
  "environmental_protection": "number",
  "flags": [
    "string"
  ],
  "id": "string",
  "initial_charges": "number",
  "material": "JSON",
  "material_thickness": "number",
  "max_charges": "number",
  "name": "string",
  "name_plural": "string",
  "note": "string",
  "phase": "string",
  "power_armor": "boolean",
  "price": "number",
  "properties": [
    [
      "string"
    ]
  ],
  "qualities": [
    [
      "scalar"
    ]
  ],
  "revert_to": "string",
  "snippet_category": [
    {
      "id": "string",
      "text": "string"
    }
  ],
  "storage": "number",
  "symbol": "string",
  "techniques": [
    "string"
  ],
  "to_hit": "number",
  "turns_per_charge": "number",
  "type": "string",
  "use_action": "JSON",
  "volume": "number",
  "warmth": "number",
  "weight": "number"
}

fadado commented 7 years ago

This is partly an enhancement request, and partly a request for comments.

When confronted with a collection of JSON entities, it is often helpful to know whether there is an implicit schema, and if so, what it is.

The following code generates a simple JSON Schema according to http://json-schema.org/latest/json-schema-validation.html:

def isobject:
    type == "object"
;
def isarray:
    type == "array"
;
def isscalar:
    type| . == "null" or . == "boolean" or . == "number" or . == "string"
;

def schema:
    { "type": type } +
    if isobject then
        if length == 0 then null
        else
            . as $object |
            { "properties": (
                reduce keys_unsorted[] as $name (
                    {};
                    . + {($name): ($object[$name] | schema)}
                )
              )
            }
        end
    elif isarray then
        if length == 0 then null
        else
            { "items": (
                if all(isscalar) and (map(type) | unique | length) == 1 then
                    { "type": (.[0] | type) }
                elif length == 1 then
                   .[0] | schema 
                else
                    reduce .[] as $item (
                        [];
                        .[length] = ($item | schema)
                    )
                end
              )
            }
        end
    else null end # scalar
;

For example, for this input:

{
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York"
  },
  "phoneNumber": [
    {
      "location": "home",
      "code": 44
    }
  ]
}

the generated schema is:

{
  "type": "object",
  "properties": {
    "address": {
      "type": "object",
      "properties": {
        "streetAddress": {
          "type": "string"
        },
        "city": {
          "type": "string"
        }
      }
    },
    "phoneNumber": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "location": {
            "type": "string"
          },
          "code": {
            "type": "number"
          }
        }
      }
    }
  }
}

nicowilliams commented 7 years ago

@fadado Beautiful! Can I borrow that for jq?

fadado commented 7 years ago

@fadado Beautiful! Can I borrow that for jq?

Of course! I have also written a validator, but it depends entirely on the modules of jq I have written; it is impossible to separate it. In any case, generator and validator are in my GitHub JBOL repository.

jqlang / jq

ER/RFC: Schema Inference #748

schema.jq