jqlang / jq

Command-line JSON processor
https://jqlang.github.io/jq/
Other
30.23k stars 1.57k forks source link

Static typing for jq, in a jq way #1952

Open nicowilliams opened 5 years ago

nicowilliams commented 5 years ago

[This is just a note so I don't forget, and a place for discussion.]

One of the things I dislike about jq is that it is dynamically-typed. It's felt like jq must be dynamically typed. But what if we can add static typing in a jq-ish way? What would that look like?

Simple Data Types

Let's start simple. Suppose we have filters like Object or Thing which are the same as ., but which cause it to be an error for the input to not be an {} or Thing (whatever a Thing might be)? Then we could instantly declare the input and output data types of expressions:

# A Thing constructor -- takes an object and outputs a Thing:
def makeThing: Object | ... | Thing;

Initially we could prototype this as run-time typing: just define functions Object, Array, and so on which raise an error if the input isn't an object, array, ... Now, the real trick would be to make the compiler aware of these semantics so it can check them at compile time -- then we'd have static typing. Besides more safety, static typing would allow us to have more monomorphization at compile-time, thus better performance (though to get really good performance we'd have to compile to object code). Similarly, we could have performant non-object record types that behave like [type-constrained] objects. We could also add more polymorphism without causing more confusion.

What else can we do? Well, if an expression ends in a conditional, and the arms of the conditional end in type assertions, then we can haz algebraic types:

def makeFooOrBaz: Object | if make_a_foo then ... | Foo else ... | Baz end;

We could say that makeFooOrBaz has type Object->Foo//Baz!

We haven't discussed yet how to define types, but that would be a lot like a normal zero-argument function def whose body consists of nothing but assertions about the type/form/contents of .. To make it easier to define types we might have the convention that a Type/1 function is one that takes a path expression and outputs just ., and which asserts that all the values at paths matched by the path expression are of type Type:

# Define a type Thing, which defines Thing/0 and Thing/1
def Type Thing: Object | Number(.n) | String(.thing_name);

What about Function Types?

We already saw that we could say that a function has type InputType -> OutputType, and that these types can be algebraic, and that we could say that a function has type InputType -> Thing1 // Thing2.

Now, in jq function values are second-class values. Also, functions of more than zero arguments are not even second-class values. So function types should be second-class types, in some sense.

Let's think of empty. We need a function type for empty. Obviously we should call the function type for empty... Empty! A function that outputs a Thing or an Empty might be said to output a Thing?!

What about a function that outputs a stream of Things? We could say it's a SomeInput -> Thing[]!

What about a function that outputs an array of Things? We could say it's a SomeInput -> [Thing]!

What about a function that outputs two Thing values? We could say it's a SomeInput -> Thing[2]!

Now, if we want the language to understand these types, then we've definitely left the realm of what can be implemented as run-time typing, and we now need new syntax and compiler support for it.

Getting Fancy

What more could we say about functions?

For zero-argument functions I suppose we could say that they must be path expressions. I'm not sure how to do that. Maybe Path(ValueTypeHere)?

Functions of more than zero arguments can never be even second-class values, so there's not much value to having language support for typing them unless we want polymorphism for them (i.e., multiple definitions of functions of the same name and arity, dispatching on input value and argument expression types). But that would be useful, so it's worth thinking about. And this also demonstrates that function types for zero-argument functions would be useful for that reason.

Function types for zero-argument functions would also be useful as a way to declare the types of expressions passed to functions of one or more arguments. Now we're getting fancy. I'm not sure what this would look like. Maybe:

def f((Foo->Bar)foo): # here foo is an expression of type `Foo -> Bar`
    ...;

Of course, to get polymorphism we'd need a new def syntax, since today def replaces (obscures, really) any previous def of the same name and arity.

Getting Really Fancy

So how to define types?

def Type TypeNameHere: <type expression>;

A type expression can be very simple: expressions that use only type functions, e.g.,

def Type Foo: Bar | Thing1 or Thing2;  # consumes Bars, outputs Thing1s or Thing2s

Type variables, if we have them at all, would be helpful for expanding polymorphism:

def Function f: a | b; # defines f as an a -> b
def Function f: a | b;
def Type Thing: ...;
def Type Foo: ...;
def f: Thing | ... | Foo; # defines an f as an a -> b where a is a Thing and b is a Foo

We might want a Not/1 type expression so we can say what type something isn't.

We might want a way to say that a type totally lists the paths in it and their value types:

# A thing is an object with only a .name string and a .n numeric
def Type Thing: Object | total(String(.name) | Number(.n));

We might be able to define behaviors for existing operators as applied to values of particular types:

def Type TwoTuple: Object | Any(.a, .b);
def Operator TwoTuple[]: .a, .b;
def Operator TwoTuple[i]: [.a,.b][i];

What about main expressions?

Since the input to the main expression can be of any type, we might need to have an Any or Undetermined or Unknown type. Or the ability to make a type assertion a run-time assertion so it can be caught:

try Thing catch ("Warning: ignoring non-Thing inputs"|warn|empty) | ...

Here the compiler would have to understand that the intent is to have Thing be a run-time assertion. We might need syntax for that:

try Thing? ...

Conclusion

Type assertions that look like and behave like named aliases of . would give us a great deal of power without sacrificing the simplicity of the jq programming language or greatly complicating jq programs. We wouldn't need special syntax for declaring types. Simply peppering type assertions into expressions would allow the compiler to back-propagate type information and statically check that expressions' input/output value types agree with their surrounding expression contexts.

This is just a sketch. No implementation work has occurred. The compiler would need a significant revamp to implement static typing, and complete rewrite to support compilation to object code (so we're likely never to go there).

We can flesh out this concept in the comments here on this issue. And maybe begin to think about how to implement.

nicowilliams commented 5 years ago

On IRC the first comment, by @Taneb, I got is that having type variables and inference would be nice. I agree!

We can have inference without type variables, but certainly type variables would be very nice. E.g., we could say that . is an a -> a.

With type variables, if we know that + is a -> a -> a then we can type-check an expression like .[]|(.+.) as applied to an expression whose outputs have values like [[0,1,2],3] and raise a compile-time error.

So in fact we'd need type variables, and we'd need syntax for function type declarations.

EDIT: I've updated the issue text to discuss type variables.

nicowilliams commented 5 years ago

I think we probably shouldn't have typeclasses -- jq is a DSL.

Taneb commented 5 years ago

If we're tracking which fields an object has and what their types are in the type system, then + for objects does not have type a -> a -> a, but something more complicated (we need to track all fields and types on the right hand side, and all fields and types that aren't in the right hand side on the left hand side, and represent them in the result type).

On the other hand, if we don't track what fields and object has and what their types are, then we lose a lot of power. The type of .foo goes from {foo: a, ...} -> a to Object -> Any or something similar, which is a lot less helpful.

leonid-s-usov commented 5 years ago

Although this is a fun topic to discuss, before diving into details and syntax and whatnot I would like to clarify one thing: Why?

Why does JQ need static (or strict runtime) typing? What benefits does it give to JQ users? What existing problems does it solve?

nicowilliams commented 5 years ago

@leonid-s-usov because:

One can define a schema like so today without any changes to jq:

def isa(p; kind): . as $dot|path(p) as $path|getpath($path)|kind|$dot;
def Object: (type=="object") or ("not an object"|error);
def Object(p): isa(p; Object);
def Array: (type=="array") or ("not an array"|error);
def Array(p): isa(p; Array);
def True: (.==true) or ("not true"|error);
def True(p): isa(p; True);
def String: (type=="string") or ("not a string"|error);
def String(p): isa(p; String);
def Number: (type=="number") or ("not a number"|error);
def Number(p): isa(p; Number);
# ...
def Thing:
    Object
  | True(.kind|String == "Thing")
  | String(.name)
  | Number(.n)
  | MatchAll(.name, .kind, .n);

def Thing(p) : isa(p; Thing);
def Things: Array | Thing(.[]);
def Things(p): isa(p; Things);
# ...

With just some syntactic sugar we can make it easier to follow this pattern by adding Type definition syntactic sugar. Then with some compiler smarts we can turn run-time type assertions into compile-time type checking. That's how I came up with this.

We could start with just the above, then add def Type Name: ...; syntax, then add compile-time type checking, then add def Function Name: ...; syntax and add more compile-time type checking. The idea is that we can progressively improve the type system as needed but in a way that is elegant and that doesn't paint us into a corner.

In any case, I opened this issue to record this idea and have a place to discuss it, but I have zero energy for implementing it. I still owe you your code review...

nicowilliams commented 5 years ago

@Taneb Yeah, and this might help us reason about what backwards-incompatible changes to make in a 2.0 release. We'd want all polymorphic operators (and functions, once we have function polymorphism as a side-effect of adding a type system) to have the same type signature.

BTW, the "arguments" to a jq-coded function of arity larger than zero are not data-value arguments, but second-class zero-arity-function-value arguments. And all jq functions take exactly one input non-function (data) value argument (.), and output exactly one such value as well (though they could instead empty). Of course, we can still use Haskell arrow notation to describe the types of all jq functions. So jq-coded functions can have signatures like a->a, a->b, or a->(x->y)->b (with a "function" argument for each jq function argument), and so on, while C-coded builtin jq functions cannot have function arguments.

pkoppstein commented 5 years ago

@nicowilliams -

Since the discussion here has touched on JSON schemas, I'd like to mention the "JSON Extended Structural Schema language" (JESS), which now has some (jq) tools to go with it (see the "Modules" page on the jq wiki).

The main points I'd like to make about JESS as a schema language for JSON are:

1) Each JESS schema consists of one or more JSON texts. No exceptions.

2) JESS extends a very simple structural schema language (SSS) in which schemas mirror their target documents exactly. In other words, simple schemas are visually simple.

(E.g. the SSS schema for an array of integers is simply [ "integer"], and for objects with string-valued keys, "a" and "b": { "a":"string", "b": "string"})

3) JESS supports complex constraints, e.g. recursive constraints, in-document referential integrity constraints, etc, while never escaping the bounds of the JSON box.

4) The JESS language support for complex constraints piggy-backs off jq's built-in functions (including .[foo], .[N], .[], most of the arity-0 filters, the regex functions, range/N, etc) and jq's concept of pipelines. In fact, the JESS verification engine includes what amounts to an interpreter for a large subset of jq.

5) JESS has both a "nullable" and a "non-nullable" mode, to determine, for example, if types such as "string" include null.

6) A schema can be spread amongst multiple files (each containing a stream of valid JSON texts).

7) Users can add type names without adding any jq defs, thus avoiding "polluting" the function name-space. This is done via "preludes", which are JSON objects. A special form of "addition" is used so that these preludes can also be written in a modular way while allowing overlap. E.g. ZipCode could be defined by a regex, and different prelude files might share its definition.

JESS has been applied "in the wild". An example of a JESS schema showing a couple of non-trivial constraints is at https://stackoverflow.com/questions/57291339/validate-each-object-key-in-the-object-array-with-json-schema-in-ajv

The JESS verification engine has been used on large datasets, e.g. the well-known citylots.json file mentioned on the jq wiki.

Since the JESS verification engine is written in jq, it exposes a useful API for the jq programmer (e.g. conforms_to(SCHEMA)).

One of the main reasons I mention all this here is that the utility of JESS and similar schema languages would be significantly enhanced if there were certain enhancements to jq. Most of these potential enhancements are already "well-known", so for now I'll just make a plug for two:

[EDIT: Point (7) was added.]

fadado commented 5 years ago

Although this is a fun topic to discuss, before diving into details and syntax and whatnot I would like to clarify one thing: Why?

What existing problems does it solve?

My main problem is to understand at a glance what a filter does. For my own use I designed a pseudo type-declaration for JQ. I put the declarations in comments, and this my design description:

Proposed grammar for filter type signatures

Type anotation                          Parameter              Value
    :: Places                               Value                  null
Places                                      Value->Stream¹         boolean
    Output                              Output                     number
    => Output                               Stream                 string
    Input => Output                         !²                     array
    (Parameters) => Output              Stream                     object
    Input|(Parameters) => Output            @³                     [Value]
Parameters                                  Value                  {Value}
    Parameter                               ?Value⁴                <Value>⁶
    Parameter; Parameters                   *Value                 Value^Value⁷
Input                                       +Value                 Letter⁸
    Value                                   Stream!⁵               Name⁹

Notes:

  1. Parameters passed by name are like parameterless filters.
  2. The character ! is the display symbol for non-terminating filters type.
  3. The character @ denotes the empty stream. Use only when results are never expected.
  4. Occurrence indicators (?, *, +) have the usual meaning.
  5. Streams output type always have an implicit union with !. To add only when non-termination is expected.
  6. Indistinct array or object: <a> ≡ [a]^{a}.
  7. Union of two value types.
  8. Single lowercase letters are type variables representing indeterminate JSON value types.
  9. Named object (construct only with the underscore character and uppercase letters).

Type signature examples

Those are the type signatures for some jq builtins:

    # empty      :: a => @
    # .          :: a => a
    # error      :: a => !
    # first(g)   :: a|(a->*b) => ?b
    # isempty(g) :: a|(a->*b) => boolean
    # select(p)  :: a|(a->boolean) => ?a
    # recurse(f) :: a|(a->?a) => +a
    # while(p;f) :: a|(a->boolean;a->?a) => *a
    # until(p;f) :: a|(a->boolean;a->?a) => a!
    # map(f)     :: [a]|(a->*b) => [b]

Real code example

# Generate ℕ
def seq: #:: => *number
    0|recurse(.+1)
;
def seq($a): #:: (number) => *number
    $a|recurse(.+1)
;
def seq($a; $d): #:: (number;$number) => *number
    $a|recurse(.+$d)
;
leonid-s-usov commented 5 years ago

The issue with data errors is that they are almost solely caused by the incoming JSON data. Since JQ is a “sed for JSON” it’s having very little if at all statically defined structures which would give benefit of static type checking

As a user of JQ I was rather suffering from the debugging complexity and syntax induced support-ability issues with non trivial scripts

I would appreciate examples of when you think such static checking would actually help.

I’m afraid static type information as you describe it will give little usage benefit while complicating many things.

I’d try to proceed with the solutions first describing the benefits/solved issues from the user perspective. Some of those could be solved with type info, and some could just be solved by enhancing features we have today already

I believe there are places where type information could be added transparently, like for example for some or all of the builtins, potentially allowing to optimize internal logic, speeding up execution or allowing for better errors. However, I’m still not sure if such refactoring will be worth the net outcome.

Sent from mobile device, typos and false corrections are possible

Regards, Leonid S. Usov

On 1 Aug 2019, at 10:22, pkoppstein notifications@github.com wrote:

@nicowilliams -

Since the discussion here has touched on JSON schemas, I'd like to mention the "JSON Extended Structural Schema language" (JESS), which now has some (jq) tools to go with it (see the "Modules" page on the jq wiki).

The main points I'd like to make about JESS as a schema language for JSON are:

Each JESS schema consists of one or more JSON texts. No exceptions.

JESS extends a very simple structural schema language (SSS) in which schemas mirror their target documents exactly. In other words, simple schemas are visually simple.

(E.g. the SSS schema for an array of integers is simply [ "integer"], and for objects with string-valued keys, "a" and "b": { "a":"string", "b": "string"})

JESS supports complex constraints, e.g. recursive constraints, in-document referential integrity constraints, etc, while never escaping the bounds of the JSON box.

The JESS language support for complex constraints piggy-backs off jq's built-in functions (including .[foo], .[N], .[], most of the arity-0 filters, the regex functions, range/N, etc) and jq's concept of pipelines. In fact, the JESS verification engine includes what amounts to an interpreter for a large subset of jq.

JESS has both a "nullable" and a "non-nullable" mode, to determine, for example, if types such as "string" include null.

A schema can be spread amongst multiple files (each containing a stream of valid JSON texts).

JESS has been applied "in the wild". An example of a JESS schema showing a couple of non-trivial constraints is at https://stackoverflow.com/questions/57291339/validate-each-object-key-in-the-object-array-with-json-schema-in-ajv

The JESS verification engine has been used on large datasets, e.g. the well-known citylots.json file mentioned on the jq wiki.

Since the JESS verifcation engine is written in jq, it exposes a useful API for the jq programmer (e.g. conforms_to(SCHEMA)).

One of the main reasons I mention all this here is that the utility of JESS and similar schema languages would be significantly enhanced if there were certain enhancements to jq. Most of these potential enhancements are already "well-known", so for now I'll just make a plug for one: currently, the "search" metadata for a module must be expressed statically.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

pkoppstein commented 5 years ago

@leonid-s-usov - Your most recent posting includes a copy of my previous posting, so let me give an example that stems from Point 7 of my updated post.

Suppose one of my functions expects a certain field to contain a "Date" value according to some specification that I would like to keep in a common data repository (shared with other users, other projects, whatever). JESS as it already exists allows me to include in my jq program something like:

assert(.date | conforms_to( "Date"); "(\.) does not conform with the Date specification")

without requiring the addition of any functions besides the fixed, small set of functions already defined in the JESS module.

nicowilliams commented 5 years ago

@leonid-s-usov for input data you want validation, and these type assertions would do that (since the compiler could not determine the type of an input, the assertion would be deferred to run-time).

Once you have input validation, however, you may still want to ensure that the rest of your code is type-safe.

leonid-s-usov commented 5 years ago

OK, so we are talking runtime here. And what it is going to add is early assertion when types arent' what we expect them to be? How is the static data type helping here?

Question: will there be an additional incentive for the users to actually annotate the script apart from receiving an error that d is not a date at the parsing stage (?) rather than at the execute stage when a date conversion function fails?

pkoppstein commented 5 years ago

OK, so we are talking runtime here.

@leonid-s-usov - I don't see why static type analysis could not benefit the jq programmer as much as with other languages. E.g., with the Date example, if the compiler can determine from the type declarations that on entry to foo, D :: Date must be true, then it can SKIP the check for D. For this to be really useful, though, the compiler would have to understand declarations such as X :: {"date": "Date"} so that X.date | foo can be optimized.

Of course there are other advantages to static type checking as well, but this is well-ploughed terrain, so my suggestion would be to focus this discussion on jq specifics, or perhaps better yet to defer discussion until either some of the backlog of PRs and other issues is cleared, or until someone is ready to make the deep dive.

christian-elsee commented 3 months ago

lol, fun read guys, but why not take it a step further and make the same arguments, for static typing, in bash shell? If you want static typing, why not use one of the many languages that support it, and preferably one that is explicitly designed around that intent. I get it though: I mean who would want to use golang, when they could add a linter to javascript and call it static typing?

wader commented 3 months ago

In my opinion i think any type system or type checking should require no extra or as little extra syntax as possible for the casual end user to preserve to neat and terse-ness of jq which i really appreciate.

Some weeks ago i talked about this with someone and they mentioned that maybe what Erlang has around type checking could be interesting for jq: A soft-typing system for Erlang and Practical Type Inference Based on Success Typings ...funny enough i think i had both Sven and Konstantinos as lecturer when i studied in Uppsala :)

christian-elsee commented 3 months ago

Ahh great, the Swedes are here. I hope someone brought a pole and/or some kanelbulle, because otherwise this 4+ year-old conversation may take a slightly unexpected turn. If the case, just make sure to thank him for curl and everything should be ok.

to preserve to neat and terse-ness of jq which i really appreciate.

Amen @wader!

To be serious for a second, curl is amazing, so thank you um.. Sweden. hej da

wader commented 3 months ago

wave-hello

😄 i've meet Daniel a few times but i used to be colleague with his brother Björn! will forward the thanks next time i see him

Glad midsommar!