Open nicowilliams opened 5 years ago
On IRC the first comment, by @Taneb, I got is that having type variables and inference would be nice. I agree!
We can have inference without type variables, but certainly type variables would be very nice. E.g., we could say that .
is an a -> a
.
With type variables, if we know that +
is a -> a -> a
then we can type-check an expression like .[]|(.+.)
as applied to an expression whose outputs have values like [[0,1,2],3]
and raise a compile-time error.
So in fact we'd need type variables, and we'd need syntax for function type declarations.
EDIT: I've updated the issue text to discuss type variables.
I think we probably shouldn't have typeclasses -- jq is a DSL.
If we're tracking which fields an object has and what their types are in the type system, then +
for objects does not have type a -> a -> a
, but something more complicated (we need to track all fields and types on the right hand side, and all fields and types that aren't in the right hand side on the left hand side, and represent them in the result type).
On the other hand, if we don't track what fields and object has and what their types are, then we lose a lot of power. The type of .foo
goes from {foo: a, ...} -> a
to Object -> Any
or something similar, which is a lot less helpful.
Although this is a fun topic to discuss, before diving into details and syntax and whatnot I would like to clarify one thing: Why?
Why does JQ need static (or strict runtime) typing? What benefits does it give to JQ users? What existing problems does it solve?
@leonid-s-usov because:
many users (including myself) get tripped up by the ease with which one can end up with run-time errors due to iterating a scalar and so on -- some amount of static type safety would help a great deal, but it probably has to be minimally intrusive, and jq-ish (which this sketch is)
Often this happens because the input data is not clean, and a static type checker might not be too helpful if you don't want to throw that unclean input data out. But in many cases the input data is regular and the problem is bugs in the jq program. As usual, static type safety helps the latter case.
static types also provide a way to describe schema and validate that values adhere to it -- this is useful on its own, but being able to make the same syntax part of (optional) static typing is nice.
One can define a schema like so today without any changes to jq:
def isa(p; kind): . as $dot|path(p) as $path|getpath($path)|kind|$dot;
def Object: (type=="object") or ("not an object"|error);
def Object(p): isa(p; Object);
def Array: (type=="array") or ("not an array"|error);
def Array(p): isa(p; Array);
def True: (.==true) or ("not true"|error);
def True(p): isa(p; True);
def String: (type=="string") or ("not a string"|error);
def String(p): isa(p; String);
def Number: (type=="number") or ("not a number"|error);
def Number(p): isa(p; Number);
# ...
def Thing:
Object
| True(.kind|String == "Thing")
| String(.name)
| Number(.n)
| MatchAll(.name, .kind, .n);
def Thing(p) : isa(p; Thing);
def Things: Array | Thing(.[]);
def Things(p): isa(p; Things);
# ...
With just some syntactic sugar we can make it easier to follow this pattern by adding Type
definition syntactic sugar. Then with some compiler smarts we can turn run-time type assertions into compile-time type checking. That's how I came up with this.
We could start with just the above, then add def Type Name: ...;
syntax, then add compile-time type checking, then add def Function Name: ...;
syntax and add more compile-time type checking. The idea is that we can progressively improve the type system as needed but in a way that is elegant and that doesn't paint us into a corner.
In any case, I opened this issue to record this idea and have a place to discuss it, but I have zero energy for implementing it. I still owe you your code review...
@Taneb Yeah, and this might help us reason about what backwards-incompatible changes to make in a 2.0 release. We'd want all polymorphic operators (and functions, once we have function polymorphism as a side-effect of adding a type system) to have the same type signature.
BTW, the "arguments" to a jq-coded function of arity larger than zero are not data-value arguments, but second-class zero-arity-function-value arguments. And all jq functions take exactly one input non-function (data) value argument (.
), and output exactly one such value as well (though they could instead empty
). Of course, we can still use Haskell arrow notation to describe the types of all jq functions. So jq-coded functions can have signatures like a->a
, a->b
, or a->(x->y)->b
(with a "function" argument for each jq function argument), and so on, while C-coded builtin jq functions cannot have function arguments.
@nicowilliams -
Since the discussion here has touched on JSON schemas, I'd like to mention the "JSON Extended Structural Schema language" (JESS), which now has some (jq) tools to go with it (see the "Modules" page on the jq wiki).
The main points I'd like to make about JESS as a schema language for JSON are:
1) Each JESS schema consists of one or more JSON texts. No exceptions.
2) JESS extends a very simple structural schema language (SSS) in which schemas mirror their target documents exactly. In other words, simple schemas are visually simple.
(E.g. the SSS schema for an array of integers is simply [ "integer"]
, and for objects with string-valued keys, "a" and "b": { "a":"string", "b": "string"}
)
3) JESS supports complex constraints, e.g. recursive constraints, in-document referential integrity constraints, etc, while never escaping the bounds of the JSON box.
4) The JESS language support for complex constraints piggy-backs off jq's built-in functions (including .[foo], .[N], .[], most of the arity-0 filters, the regex functions, range/N, etc) and jq's concept of pipelines. In fact, the JESS verification engine includes what amounts to an interpreter for a large subset of jq.
5) JESS has both a "nullable" and a "non-nullable" mode, to determine, for example, if types such as "string" include null
.
6) A schema can be spread amongst multiple files (each containing a stream of valid JSON texts).
7) Users can add type names without adding any jq defs, thus avoiding "polluting" the function name-space. This is done via "preludes", which are JSON objects. A special form of "addition" is used so that these preludes can also be written in a modular way while allowing overlap. E.g. ZipCode could be defined by a regex, and different prelude files might share its definition.
JESS has been applied "in the wild". An example of a JESS schema showing a couple of non-trivial constraints is at https://stackoverflow.com/questions/57291339/validate-each-object-key-in-the-object-array-with-json-schema-in-ajv
The JESS verification engine has been used on large datasets, e.g. the well-known citylots.json file mentioned on the jq wiki.
Since the JESS verification engine is written in jq, it exposes a useful API for the jq programmer (e.g. conforms_to(SCHEMA)
).
One of the main reasons I mention all this here is that the utility of JESS and similar schema languages would be significantly enhanced if there were certain enhancements to jq. Most of these potential enhancements are already "well-known", so for now I'll just make a plug for two:
[EDIT: Point (7) was added.]
Although this is a fun topic to discuss, before diving into details and syntax and whatnot I would like to clarify one thing: Why?
What existing problems does it solve?
My main problem is to understand at a glance what a filter does. For my own use I designed a pseudo type-declaration for JQ. I put the declarations in comments, and this my design description:
Type anotation Parameter Value
:: Places Value null
Places Value->Stream¹ boolean
Output Output number
=> Output Stream string
Input => Output !² array
(Parameters) => Output Stream object
Input|(Parameters) => Output @³ [Value]
Parameters Value {Value}
Parameter ?Value⁴ <Value>⁶
Parameter; Parameters *Value Value^Value⁷
Input +Value Letter⁸
Value Stream!⁵ Name⁹
Notes:
!
is the display symbol for non-terminating filters type.@
denotes the empty stream. Use only when results are never expected.?
, *
, +
) have the usual meaning.!
. To add only
when non-termination is expected.<a> ≡ [a]^{a}
.Those are the type signatures for some jq builtins:
# empty :: a => @
# . :: a => a
# error :: a => !
# first(g) :: a|(a->*b) => ?b
# isempty(g) :: a|(a->*b) => boolean
# select(p) :: a|(a->boolean) => ?a
# recurse(f) :: a|(a->?a) => +a
# while(p;f) :: a|(a->boolean;a->?a) => *a
# until(p;f) :: a|(a->boolean;a->?a) => a!
# map(f) :: [a]|(a->*b) => [b]
# Generate ℕ
def seq: #:: => *number
0|recurse(.+1)
;
def seq($a): #:: (number) => *number
$a|recurse(.+1)
;
def seq($a; $d): #:: (number;$number) => *number
$a|recurse(.+$d)
;
The issue with data errors is that they are almost solely caused by the incoming JSON data. Since JQ is a “sed for JSON” it’s having very little if at all statically defined structures which would give benefit of static type checking
As a user of JQ I was rather suffering from the debugging complexity and syntax induced support-ability issues with non trivial scripts
I would appreciate examples of when you think such static checking would actually help.
I’m afraid static type information as you describe it will give little usage benefit while complicating many things.
I’d try to proceed with the solutions first describing the benefits/solved issues from the user perspective. Some of those could be solved with type info, and some could just be solved by enhancing features we have today already
I believe there are places where type information could be added transparently, like for example for some or all of the builtins, potentially allowing to optimize internal logic, speeding up execution or allowing for better errors. However, I’m still not sure if such refactoring will be worth the net outcome.
Sent from mobile device, typos and false corrections are possible
Regards, Leonid S. Usov
On 1 Aug 2019, at 10:22, pkoppstein notifications@github.com wrote:
@nicowilliams -
Since the discussion here has touched on JSON schemas, I'd like to mention the "JSON Extended Structural Schema language" (JESS), which now has some (jq) tools to go with it (see the "Modules" page on the jq wiki).
The main points I'd like to make about JESS as a schema language for JSON are:
Each JESS schema consists of one or more JSON texts. No exceptions.
JESS extends a very simple structural schema language (SSS) in which schemas mirror their target documents exactly. In other words, simple schemas are visually simple.
(E.g. the SSS schema for an array of integers is simply [ "integer"], and for objects with string-valued keys, "a" and "b": { "a":"string", "b": "string"})
JESS supports complex constraints, e.g. recursive constraints, in-document referential integrity constraints, etc, while never escaping the bounds of the JSON box.
The JESS language support for complex constraints piggy-backs off jq's built-in functions (including .[foo], .[N], .[], most of the arity-0 filters, the regex functions, range/N, etc) and jq's concept of pipelines. In fact, the JESS verification engine includes what amounts to an interpreter for a large subset of jq.
JESS has both a "nullable" and a "non-nullable" mode, to determine, for example, if types such as "string" include null.
A schema can be spread amongst multiple files (each containing a stream of valid JSON texts).
JESS has been applied "in the wild". An example of a JESS schema showing a couple of non-trivial constraints is at https://stackoverflow.com/questions/57291339/validate-each-object-key-in-the-object-array-with-json-schema-in-ajv
The JESS verification engine has been used on large datasets, e.g. the well-known citylots.json file mentioned on the jq wiki.
Since the JESS verifcation engine is written in jq, it exposes a useful API for the jq programmer (e.g. conforms_to(SCHEMA)).
One of the main reasons I mention all this here is that the utility of JESS and similar schema languages would be significantly enhanced if there were certain enhancements to jq. Most of these potential enhancements are already "well-known", so for now I'll just make a plug for one: currently, the "search" metadata for a module must be expressed statically.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@leonid-s-usov - Your most recent posting includes a copy of my previous posting, so let me give an example that stems from Point 7 of my updated post.
Suppose one of my functions expects a certain field to contain a "Date" value according to some specification that I would like to keep in a common data repository (shared with other users, other projects, whatever). JESS as it already exists allows me to include in my jq program something like:
assert(.date | conforms_to( "Date"); "(\.) does not conform with the Date specification")
without requiring the addition of any functions besides the fixed, small set of functions already defined in the JESS module.
@leonid-s-usov for input data you want validation, and these type assertions would do that (since the compiler could not determine the type of an input, the assertion would be deferred to run-time).
Once you have input validation, however, you may still want to ensure that the rest of your code is type-safe.
OK, so we are talking runtime here. And what it is going to add is early assertion when types arent' what we expect them to be? How is the static data type helping here?
Question: will there be an additional incentive for the users to actually annotate the script apart from receiving an error that d
is not a date at the parsing stage (?) rather than at the execute stage when a date conversion function fails?
OK, so we are talking runtime here.
@leonid-s-usov - I don't see why static type analysis could not benefit the jq programmer as much as with other languages. E.g., with the Date example, if the compiler can determine from the type declarations that on entry to foo, D :: Date must be true, then it can SKIP the check for D. For this to be really useful, though, the compiler would have to understand declarations such as X :: {"date": "Date"}
so that X.date | foo
can be optimized.
Of course there are other advantages to static type checking as well, but this is well-ploughed terrain, so my suggestion would be to focus this discussion on jq specifics, or perhaps better yet to defer discussion until either some of the backlog of PRs and other issues is cleared, or until someone is ready to make the deep dive.
lol, fun read guys, but why not take it a step further and make the same arguments, for static typing, in bash shell? If you want static typing, why not use one of the many languages that support it, and preferably one that is explicitly designed around that intent. I get it though: I mean who would want to use golang, when they could add a linter to javascript and call it static typing?
In my opinion i think any type system or type checking should require no extra or as little extra syntax as possible for the casual end user to preserve to neat and terse-ness of jq which i really appreciate.
Some weeks ago i talked about this with someone and they mentioned that maybe what Erlang has around type checking could be interesting for jq: A soft-typing system for Erlang and Practical Type Inference Based on Success Typings ...funny enough i think i had both Sven and Konstantinos as lecturer when i studied in Uppsala :)
Ahh great, the Swedes are here. I hope someone brought a pole and/or some kanelbulle, because otherwise this 4+ year-old conversation may take a slightly unexpected turn. If the case, just make sure to thank him for curl and everything should be ok.
to preserve to neat and terse-ness of jq which i really appreciate.
Amen @wader!
To be serious for a second, curl is amazing, so thank you um.. Sweden. hej da
😄 i've meet Daniel a few times but i used to be colleague with his brother Björn! will forward the thanks next time i see him
Glad midsommar!
[This is just a note so I don't forget, and a place for discussion.]
One of the things I dislike about jq is that it is dynamically-typed. It's felt like jq must be dynamically typed. But what if we can add static typing in a jq-ish way? What would that look like?
Simple Data Types
Let's start simple. Suppose we have filters like
Object
orThing
which are the same as.
, but which cause it to be an error for the input to not be an{}
orThing
(whatever aThing
might be)? Then we could instantly declare the input and output data types of expressions:Initially we could prototype this as run-time typing: just define functions
Object
,Array
, and so on which raise anerror
if the input isn't an object, array, ... Now, the real trick would be to make the compiler aware of these semantics so it can check them at compile time -- then we'd have static typing. Besides more safety, static typing would allow us to have more monomorphization at compile-time, thus better performance (though to get really good performance we'd have to compile to object code). Similarly, we could have performant non-object record types that behave like [type-constrained] objects. We could also add more polymorphism without causing more confusion.What else can we do? Well, if an expression ends in a conditional, and the arms of the conditional end in type assertions, then we can haz algebraic types:
We could say that
makeFooOrBaz
has typeObject->Foo//Baz
!We haven't discussed yet how to define types, but that would be a lot like a normal zero-argument function
def
whose body consists of nothing but assertions about the type/form/contents of.
. To make it easier to define types we might have the convention that aType/1
function is one that takes a path expression and outputs just.
, and which asserts that all the values at paths matched by the path expression are of typeType
:What about Function Types?
We already saw that we could say that a function has type
InputType -> OutputType
, and that these types can be algebraic, and that we could say that a function has typeInputType -> Thing1 // Thing2
.Now, in jq function values are second-class values. Also, functions of more than zero arguments are not even second-class values. So function types should be second-class types, in some sense.
Let's think of
empty
. We need a function type forempty
. Obviously we should call the function type forempty
...Empty
! A function that outputs aThing
or anEmpty
might be said to output aThing?
!What about a function that outputs a stream of
Thing
s? We could say it's aSomeInput -> Thing[]
!What about a function that outputs an array of
Thing
s? We could say it's aSomeInput -> [Thing]
!What about a function that outputs two
Thing
values? We could say it's aSomeInput -> Thing[2]
!Now, if we want the language to understand these types, then we've definitely left the realm of what can be implemented as run-time typing, and we now need new syntax and compiler support for it.
Getting Fancy
What more could we say about functions?
For zero-argument functions I suppose we could say that they must be path expressions. I'm not sure how to do that. Maybe
Path(ValueTypeHere)
?Functions of more than zero arguments can never be even second-class values, so there's not much value to having language support for typing them unless we want polymorphism for them (i.e., multiple definitions of functions of the same name and arity, dispatching on input value and argument expression types). But that would be useful, so it's worth thinking about. And this also demonstrates that function types for zero-argument functions would be useful for that reason.
Function types for zero-argument functions would also be useful as a way to declare the types of expressions passed to functions of one or more arguments. Now we're getting fancy. I'm not sure what this would look like. Maybe:
Of course, to get polymorphism we'd need a new
def
syntax, since todaydef
replaces (obscures, really) any previous def of the same name and arity.Getting Really Fancy
So how to define types?
A type expression can be very simple: expressions that use only type functions, e.g.,
Type variables, if we have them at all, would be helpful for expanding polymorphism:
We might want a
Not/1
type expression so we can say what type something isn't.We might want a way to say that a type totally lists the paths in it and their value types:
We might be able to define behaviors for existing operators as applied to values of particular types:
What about main expressions?
Since the input to the main expression can be of any type, we might need to have an
Any
orUndetermined
orUnknown
type. Or the ability to make a type assertion a run-time assertion so it can be caught:Here the compiler would have to understand that the intent is to have
Thing
be a run-time assertion. We might need syntax for that:Conclusion
Type assertions that look like and behave like named aliases of
.
would give us a great deal of power without sacrificing the simplicity of the jq programming language or greatly complicating jq programs. We wouldn't need special syntax for declaring types. Simply peppering type assertions into expressions would allow the compiler to back-propagate type information and statically check that expressions' input/output value types agree with their surrounding expression contexts.This is just a sketch. No implementation work has occurred. The compiler would need a significant revamp to implement static typing, and complete rewrite to support compilation to object code (so we're likely never to go there).
We can flesh out this concept in the comments here on this issue. And maybe begin to think about how to implement.