influxdata / flux

Flux is a lightweight scripting language for querying databases (like InfluxDB) and working with data. It's part of InfluxDB 1.7 and 2.0, but can be run independently of those.
https://influxdata.com
MIT License
767 stars 154 forks source link

Add "dynamic" type for gradual typing #1121

Closed nathanielc closed 1 week ago

nathanielc commented 5 years ago

I propose that we add a dynamic type to Flux that represents a value with a type that is only known at runtime. This kind of type allows for gradual typing, meaning that Flux can be both dynamic and statically typed. Where only static types are used the Flux compiler can compile efficient type unaware code. Where a dynamic type is used the Flux compiler will need to inject type checks into the runtime code.

See https://github.com/tomprimozic/type-systems/tree/master/gradual_typing for an explanation of gradual typing.

See http://siek.blogspot.com/2012/10/is-typescript-gradually-typed-part-2.html for a discussion of gradual typing in TypeScript and different levels of gradual typing.

A dynamic type allows for Flux to describe data that has unknown structure and then during the cleaning process of the data the type information is added allowing for type safe handling of the data.

For example given this script:

import "csv"

csv.from(file:"/tmp/path/to/data.csv")
    |> map(fn: (r) => ({r with _value2 : r._value * 2.0})) // _value2 is a float

Initially the type of the data in the csv file is unknown and so a dynamic type is used to represent that data. Once the map operation is applied it is known that the data has a _value column of type float. So later if the _value column is used as a string a type error can be produced.

import "csv"

csv.from(file:"/tmp/path/to/data.csv")
    |> map(fn: (r) => ({r with _value2: r._value * 2.0})) // _value2 is a float
    |> map(fn: (r) => ({r with str:  r._value + "word")})) // type error float != string

A dynamic type is only useful if we can learn enough about it such that we can learn its static type. If dynamic types "infect" the rest of the type system that the entire type system is reduced to being dynamically typed then we have gained nothing.

For example all sources will have type signatures that return tables of dynamic records, since the types will not be known until the data is read from the source. (As an aside it might be possible to allow sources to inform the compiler of the types in a pseudo "static" manner but let's ignore that for now). This means that all data coming from sources starts as dynamic, meaning we know nothing about its type. The goal is to be able to apply constraints to the type of the data once we have seen how the data is used and then provide good error messages to the user about the types of their data.

For example in the above Flux script no explicit cast was made on the data to ensure it was a float. Rather we see that the data is being used as a float and so we conclude that the data must be a float. If during runtime we discover the data is not a float then a type error is produced. If the users wants to defend against inconsistently typed input data then an explicit cast step can be added:

For example the float function can be used explicitly to accept any type on the input type of the data and explicitly convert it to a float.

import "csv"

csv.from(file:"/tmp/path/to/data.csv")
    // _value can be any type that the function `float` accepts
    // _value2 is still statically known to be a float.
    |> map(fn: (r) => ({r with _value2: float(v:r._value) * 2.0})) 

This is a rough draft a proposal at this point and needs to be worked out before we know if it is possible.

nathanielc commented 5 years ago

Adding a discussion about type errors. I think we should treat all type errors as the same independent of if we hit them at runtime or compile time.

If a type error is encountered the compiler fails and reports the error, no Flux code is executed.

This would imply that a type error encountered at runtime from a check on a dynamic value, would result in the current runtime exiting with the type error.

Exiting a query for a runtime type error is a valid behavior because the user can specifically code their Flux scripts to handle these kinds of type errors when needed, i.e. using an explicit type conversion function and checking the result.

aanthony1243 commented 5 years ago

Parts of this may still cause friction with users. If we set the type as static after first observing a record, then the typing becomes not only dynamic but non-deterministic if a column may change types between tables.

I can't believe I'm suggesting this, but I think we need something more like perl which to me is untyped in the sense that whatever is stored in a record, perl will do its best to treat the value how you ask it. That is, instead of type errors, we would have conversion errors. So if we are treating a field as a string, but the source sends us an Int, then we encode the int as a string. Likewise, if the value is a string, and we want an int, will try to decode it as an int, but if we fail, then that's a runtime error.

With this approach, the type system once again becomes static (by context of the expression) and yet we retain the best possible flexibility for unknown record types.

The outlier here is what to do if a record is not present, but we have discussed this separately and there's a good plan to have an exists function that can be applied to a record.

aanthony1243 commented 5 years ago

on top of this, I want to do away with int/float distinction, again encoding those values appropriately as context dictates (e.g. bit shift is intended for integers, division is for floats, modulo is for ints, writes do a data sink can have explicit type demands, etc)

nathanielc commented 5 years ago

Also see this https://en.wikipedia.org/wiki/Flow-sensitive_typing

aanthony1243 commented 5 years ago

I think flow typing might satisfy what I'm asking for, basically, that the type of a variable is static only within its current scope. The type is carried over from the parent scope, but the context of the new scope may change the type if conversion is possible. I guess instead of our type system determining that _value2 "is-a" float, I want the system to determine that _value2 "can-be" a number or a string.

similarly, an object {a:1.0, b:2} "can be" of type {a:float, b: float} or {a:string, b: int} among other options.

maybe the correct terms are "was-a", "can-be" and "must-be" instead of "is-a". then "was-a ???" can be deemed a valid response, giving us the concept of dynamics in a consistent way. Finally, type inference is the process of tightening, as much as possible, cases where we can say "was-a" == "can-be" == "must-be"

jpacik commented 5 years ago

I don't really see what a dynamic type gets us to be honest. The examples given can all be statically type inferred. For instance the generalized type of csv.from can be given by:

forall ['r] where 'r:Rec (file: str) -> ['r]

In other words, csv.from takes a string and returns a list of records. Similarly the type of map is:

forall ['r, 's] where 'r:Rec, 's:Rec (tables: ['r], fn: (r: 'r) -> 's) -> ['s]

The following query would fail type unification (at compile time) with the exact same type error

import "csv"

csv.from(file:"/tmp/path/to/data.csv")
    |> map(fn: (r) => ({r with _value2: r._value * 2.0})) // _value2 is a float
    |> map(fn: (r) => ({r with str:  r._value + "word")})) // type error float != string

For this query:

import "csv"

csv.from(file:"/tmp/path/to/data.csv")
    |> map(fn: (r) => ({r with _value2 : r._value * 2.0})) // _value2 is a float

r._value would be inferred to be a float which means csv.from would be instantiated to the following monotype at the call site

(file: string) -> { _value: float | r }

This information could be passed to the function itself so that csv.from could perform the runtime type check that _value must be a float.

1613

github-actions[bot] commented 2 weeks ago

This issue has had no recent activity and will be closed soon.