Proposal: Unit type checking

michaeldickens commented 6 days ago

(DO NOT MERGE)

This is a demo PR to illustrate how unit type checking might work (see #3309). If Squiggle maintainers like my proposal, I can finish implementing it. I'm also fine with tabling this PR if maintainers think it would take too long to agree on a good implementation. I do think it would be really nice to have unit typing, it could prevent a lot of bugs.

This PR is only a demo, it has various missing parts and the code quality isn't consistent. See "Open questions" below.

Syntax

Squiggle unit types are declared using approximately the same syntax as TypeScript. Example:

x = 1  // implicit type
y: kg = 1  // type is `kg`

A type signature consists of one or more identifiers separated by zero or more * symbols, followed by zero or more identifiers separated by zero or more / symbols. In other words, type signatures must match the pseudo-regex \identifier(*\identifier)*(/\identifier)*.

Some more examples of type signatures:

speed: m/s = 1
force: kg*m/s/s = 2
work: kg*m*m/s/s = 3

Semantics

The parser type-checks variable definitions and expressions.

For this demo, the type checking rules are fairly simple:

Any variable without a declared type is simply assumed to be the correct type, even if that's not logically possible.
An expression that multiplies or divides two variables with known types infers the type of the result. For example, if x has type kg and y has type liter, x/y has type kg/liter.
If two values with known types are combined using an infix operator other than * or /, the type checker enforces that the two values have the same type. For example, you can only add/subtract two variables if they have the same type (or if at least one of them does not have a declared type).
If a typed variable is assigned to a value with a known type, the type checker enforces that the value has the correct type, and raises a type error if not.

Implementation details

I implemented type checking in the way that looked best to me, but Squiggle maintainers may have a better idea of how to do it.

In parse.ts:parse(), after generating the AST, I added a call to the typeCheck() function, which is defined in a new file typeChecker.ts.

Open questions

There's definitely more that needs to be implemented to make this PR viable, but I didn't want to make too many design decisions without consulting the Squiggle team.

Should the type checker do type inference, or should it simply ignore any variable without a type (like TypeScript's any)? Type inference is more work but it looks doable to me. But it's not obvious that type inference is actually desirable from a design standpoint. (I lean toward yes.)
How to treat numeric literals? I see two reasonable possibilities: (1) let them be any type, or (2) treat them as unitless unless directly assigned to a variable (e.g. in x: kg = 1, the 1 should be treated as having type kg, not unitless). Treating them as unitless would help catch more type errors, but it also means you'd need a syntax for type conversion. For example, if you have a distance literal and a time variable, converting them to velocity requires declaring the type of the literal, such as usainBoltSpeed: m/s = (100 as m) / raceTime.
Is my proposed syntax ok? There are ways it might be improved, for example allowing parentheses and/or exponents: work: (kg*m^2)/(s^2) = 1
Do probability distributions require any special treatment? I believe they can be treated just like numeric values.
What syntax should function definitions use? I'm thinking it could use basically the same syntax as TypeScript: func(x: type): type = ...
How to type arrays and dicts? Should Squiggle require that the values in an array/dict all have the same type? What should the type signature syntax be? Again, it can probably be the same as TypeScript.
It would be good for SquiggleHub to have a good way of showing types. How should squiggle-lang supply type info make that work? (SquiggleHub does some stuff with GraphQL that I don't understand but I haven't looked into it much.)
How to handle type errors? Should they be a subclass of ICompileError, or should they be handled separately? Does the frontend need some special way to consume them?

changeset-bot[bot] commented 6 days ago

⚠️ No Changeset found

Latest commit: e836e4d721a6757809cd9b339a319498bb30041a

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

vercel[bot] commented 6 days ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated (UTC)
quri-ui	✅ Ready (Inspect)	Visit Preview	Jun 24, 2024 8:04pm
squiggle-components	✅ Ready (Inspect)	Visit Preview	Jun 24, 2024 8:04pm
squiggle-website	✅ Ready (Inspect)	Visit Preview	Jun 24, 2024 8:04pm

vercel[bot] commented 6 days ago

@michaeldickens is attempting to deploy a commit to the Quantified Uncertainty Team on Vercel.

A member of the Team first needs to authorize it.

berekuk commented 3 days ago

Quick comment based on the overall idea, I haven't looked at the implementation carefully yet.

Thanks for doing this! This is definitely something we discussed and want to support.

My main problem with the syntax that you picked is that besides unit type checking, we'll also want to implement other type checks in the future.

For this reason I'd strongly prefer if we used something else than : to provide the unit portion of the type.

This is easy to fix if we go with :: instead of :, for now:

speed :: m/s = 1

The design for actual types (that would allow to express other value types, e.g. functions or lists or whatever) is still not finalized and will probably take a while, but note that we already have annotations in function signatures.

Units are obviously not types, because:

they can't express any types other than numbers
their names are too free-form, so you can't refer to type or variable names in type signatures

Btw, this is tangential, but my current plan is to eventually support domains everywhere:

domains are values (can be constructed in runtime) that can be used in type signatures
domains are generics from which compile-time types can be inferred
domains also do runtime checks when compile-time checks are too difficult
we could also have syntax sugar because full domain definitions are too clunky; similar to how in function signatures we already upgrade [5, 10] to Number.rangeDomain(5, 10); units could also be extended in the similar way, x :: m/s -> x: Number.unit("m") / Number.unit("s")

With domains, we could start with completely runtime type-checks and gradually push the boundary towards compile-time checks where it's possible; but it won't be possible in all cases.

This also means that for now I don't really care if unit type checking or any other type checking is done in the compile-time. It's fine to prototype type checking in runtime, and then optimize later (if we pay attention and avoid features that are too powerful and can't be optimized later).

berekuk commented 3 days ago

Now on these questions:

Should the type checker do type inference, or should it simply ignore any variable without a type (like TypeScript's any)? Type inference is more work but it looks doable to me. But it's not obvious that type inference is actually desirable from a design standpoint. (I lean toward yes.)

Type inference sounds very useful, and as I said above, its lack is dangerous because of backward compatibility problems.

I'm leaning towards "let's just infer in runtime", though.

How to treat numeric literals? I see two reasonable possibilities: (1) let them be any type, or (2) treat them as unitless unless directly assigned to a variable (e.g. in x: kg = 1, the 1 should be treated as having type kg, not unitless). Treating them as unitless would help catch more type errors, but it also means you'd need a syntax for type conversion. For example, if you have a distance literal and a time variable, converting them to velocity requires declaring the type of the literal, such as usainBoltSpeed: m/s = (100 as m) / raceTime.

Is my proposed syntax ok? There are ways it might be improved, for example allowing parentheses and/or exponents: work: (kg*m^2)/(s^2) = 1

It's fine if we don't support these for now.

Do probability distributions require any special treatment? I believe they can be treated just like numeric values.

I'm not sure. I think units syntax is fine for dists if we store types separately (x : type :: units = ..., as I said in my previous comment), but I haven't thought about this too much.

What syntax should function definitions use? I'm thinking it could use basically the same syntax as TypeScript: func(x: type): type = ...

How to type arrays and dicts? Should Squiggle require that the values in an array/dict all have the same type? What should the type signature syntax be? Again, it can probably be the same as TypeScript.

I don't consider units to be full types, and doing both at the time would be too complicated; see my comment on domains above.

It would be good for SquiggleHub to have a good way of showing types. How should squiggle-lang supply type info make that work? (SquiggleHub does some stuff with GraphQL that I don't understand but I haven't looked into it much.)

GraphQL doesn't seem relevant here; runtime inference based on tags would be very helpful.

How to handle type errors? Should they be a subclass of ICompileError, or should they be handled separately? Does the frontend need some special way to consume them?

ICompileError seems fine for this, can be improved later. But, again, if we do runtime inference, then it's going to be just a normal runtime error, for now.

michaeldickens commented 3 days ago

@berekuk Based on your comments, here's my to-do list. Let me know if I should add anything.

Update unit type syntax to not conflict with (future) normal type syntax.
Implement type inference.
Treat numeric literals as any unit type. (Note: This will create backward compatibility issues if we decide later to make literals unitless.)
Add unit type syntax to the prettier plugin.
Unit type checking on distributions.
Minor fixes e.g. delete unused code.

Also: Can you explain what you mean by "domain"? And relatedly, why do you think it's too complicated to do unit types for function definitions and arrays/dicts?

berekuk commented 39 minutes ago

Update unit type syntax to not conflict with (future) normal type syntax. Implement type inference.

I think that tracking unit info in runtime, through tags, would be easier for now, and would give us same benefits as full type inference at the cost of a slight performance overhead. But either approach is fine, and we'll want compile-time type checks and inference eventually. I just think it's more work for you.

Full type inference is especially tough, so I'll explain my priorities for it:

some solution for units in nested blocks is a blocker for merging this PR (worst case scenario, just ban unit syntax in blocks entirely)
type inference on trivial assignments and arithmetic operations is also nice for the future backward compatibility, but I would agree to merge if (1) is fixed
type inference on generic functions is

For clarity, in (1), I mean this:

x = {
  y :: kg = 1
  z :: m = y
  y + z
}

This would be quite unexpected for users, the feature looks like it should work but it doesn't do anything, and it's a footgun. So I'd like this to be supported before we merge.

In (2) I mean this:

x :: kg = 1
y = x

And this:

x :: m = 1
y :: s = 2
z = x / y    /* m/s? */

And in (3) I mean this:

f(a) = a
x :: kg = 1
y = f(x)

This one is the most difficult, and you definitely shouldn't attempt it now; some languages (Haskell and OCaml) are smart enough to infer that f maps any generic 'a to 'a, so it would infer that y has the same type as x. But this would require something like full Hindley–Milner.

Or, you could get all of (1), (2) and (3) for free, and also easier access to the unit information during render, if we do runtime tracking with tags.

berekuk commented 23 minutes ago

Also: Can you explain what you mean by "domain"?

The term "domain" is briefly mentioned in the parameter annotation documentation, but not very well explained.

The idea is that any check ("is this a valid value to use here?") can include runtime checks ("does the value belong to the [10, 20] range?") and compile-time checks ("is this a number?"). We plan to bundle both of these checks in a single "domain": a meta-value that describes both components.

Unlike in Typescript, domains are values, not a separate namespace of "type names". They can be constructed in runtime, e.g. with functions, passed around as parameters, and so on.

xDomain = Number.rangeDomain({ min: 5, max: 10 })
x: xDomain = 7

Here, xDomain type would be, let's say, Domain<Number>, and when you annotate x: xDomain, Squiggle would infer that x type would be Number. So it would do the compile-time check that x is a number, by checking that 7 is also a number.

In addition to that, Squiggle would also do the runtime check to make sure that 7 is in [5, 10] range.

The currently supported f(x: [5, 10]) = ... syntax for parameters annotation is a syntax sugar on top of domains: [5, 10] annotation used after : gets upgraded to the domain. This would work in the same way for variable declarations, x: [5, 10] = 7.

We will support other annotation shortcuts in the future, because full domain constructors are too verbose. Also, common types will also become domain constructors, e.g. x: Number = 7. Number is a function, but the "annotation -> domain" syntax sugar layer will upgrade it to the domain, so it's the same as, let's say, Number.anyValueDomain().

This is the very preliminary plan for the future type system in Squiggle. It's going to be somewhat complicated to implement, because it's heavy on generics, but it seems doable. I'm not sure when we'll get around to implementing all this, tbh.

quantified-uncertainty / squiggle