vapourlang / vapour

Typed superset of R
http://vapour.run
Apache License 2.0
162 stars 2 forks source link

Redefine the role of `NA` #62

Open lhdjung opened 6 days ago

lhdjung commented 6 days ago

Hi, this is an amazing project! Here is an idea that might go a bit against your new na type, so feel free to close if out of scope.

Vapour might be an opportunity to more precisely define the scope of NA by (only) having it represent unknown values. In R, NA moonlights as a value that signifies absence, much like NULL. I think this is quite unfortunate because the interpretation is very different – a real but unknown value versus one that is known not to exist. For example:

# Real but unknown
c(1, 3, NA, 10)
#> [1]  1  3 NA 10

# Known not to exist
letters[27]
#> [1] NA

Created on 2024-09-16 with reprex v2.1.1

Eliminating the second case would have obvious benefits, but doing so while keeping the first case would also disambiguate the interpretation of NA as an unknown value.

Maybe your na is more like NULL, but I think native representation of unknown values is a great asset of R. It makes sense to model uncertainty about data in the language. However, this requires missing value placeholders to have the same type as the known values in the same (atomic) vector.

Somewhat related to https://github.com/vapourlang/vapour/issues/12 – if NA is part of an atomic vector, it is an unknown value of the same type as that vector. Therefore, if this part is meant to work as in R (which it might not be), an ideal type inference scheme would be orthogonal to a vector containing NA or not.

JohnCoene commented 6 days ago

Thank you for the kind words!

Vapour is very young and will fall short of a lot of things. What you describe there is something I admittedly did not consider.

I don't think a superset of a language could really handle that in the manner you describe but I might be wrong, or I just misunderstand your comment which could be true too.

The first vector you define you defined with

let x: int | na = (1, 3, NA, 10)

The transpile code would work as expected where indeed NAdefines a "known" missing value. I honestly think the na and null types in Vapour are fine.

However on your second point it is indeed something the language should aim to tackle. Vapour can't prevent such case in the resulting R code but could prevent writing code that results in the error.

The only thing that comes to mind right now would be to define the length (or max length) of a type, like we define "array of size x" in other languages. Vapour could then check that we do not try to access a value outside of the specified range. It would make for quite a change in the syntax but may be worth it.

This might link to #60 where we'd probably need to generate sensible default values for native types which is doable, e.g.: Go.

However I'm not sure how practical this would be given we rarely know the length of vectors, lists, etc. at build time. When using letters we do but when reading data from a database or elsewhere we seldom do. If you have any ideas I'm totally open.

Your point raises another problem I had not foreseen though, technically every time we access an item at a certain index we actually expects different types.

E.g.: Vapour will not flag anything wrong with the code below

let x: int = (1, 2, 3)

x = x[4]

But technically every time we access something like this we should expect either NA or NULL

let vec: int = (1, 2, 3)
let vec2: int | na = vec[4]

type lst: list { int | na }

let theList: lst = lst(1, NA , 2)
let theList2: lst | na = theList[4]

With the current type logic and syntax we should write the above and it definitely isn't right.

lhdjung commented 5 days ago

Thanks for your thoughtful response.

It makes perfect sense that a superset of R might not be the right place to solve the issues I was trying to get at (see below, "More on NA in R") because it will need to accommodate existing R code.

One possible way to sneak in improvements to transpiled R code might be adding attributes and defining their behavior, as in the following. Yet again that might not fit a language superset, it would add runtime overhead, and this particular code is obviously just a rough sketch.

x <- 1:5

# Problematic
x[10]
#> [1] NA

`[.vapour_object` <- function(x, index) {
    if (is.numeric(index) && index > length(x)) {
        stop(paste0(
            "Index out of bounds. The index is ",
            index,
            " but the object to be subset only has length ",
            length(x),
            "."
        ))
    }
  x[index]
}

x <- structure(1:5, class = c("vapour_object", "integer"))

# Better
x[10]
#> Error in `[.vapour_object`(x, 10): Index out of bounds. The index is 10 but the object to be subset only has length 5.

# Equivalent to the above definition of `x`
x <- 1:5
class(x) <- c("vapour_object", class(x))

x[10]
#> Error in `[.vapour_object`(x, 10): Index out of bounds. The index is 10 but the object to be subset only has length 5.

Created on 2024-09-17 with reprex v2.1.1

Edit: added x[index] as return value if the index is within bounds; but this can fail in practice. Maybe NextMethod() would work better?)

More on NA in R

To clarify, I thought it would be ideal for a language like R to use NA only as an "explicit missing value", as in c(1, 3, NA, 10), and not as a nullish value, as returned by letters[27]. The first case is an unknown presence (we don't know which number NA conceptually represents), but the second is a known absence (we do know that there is no 27th letter).

Some languages complain at runtime that something is amiss when indexing out of bounds. Python throws an error, and Rust panics. I think this kind of behavior is better per se because it's much more safe and clear. Subsequent code might not be prepared to handle NA.

However, using NA as a return value of letters[27] is arguably even worse than using NULL would be instead because it creates needless ambiguity when encountering NA: is it an empirically unknown value because, e.g., we could not measure the third value in the series of tests represented by c(1, 3, NA, 10); or is it the result of a programming issue like indexing out of bounds?

As I understand your example with int | na, Vapour defines na as an alternative to the other types. Yet what is great about NA in R is that it is baked into the definition of all the basic atomic vector types. For example, any logical value could be TRUE, FALSE, or NA. This type actually implements Kleene's three-valued logic, so NA is inherent in the type definition.

Note that the type is called "logical", not "Boolean"; it never implemented two-valued logic to begin with, and there is no separate Boolean type. The inherence of NA enables users to encode a lack of knowledge about certain values – the main purpose of NA.

JohnCoene commented 5 days ago

Apologies if I expressed myself poorly but I understood your point originally, I understand the distinction between NULL and NA.

I think @jonocarroll nails it in #67, NA essentially should not be a type, it's a just value that should be accepted in many places. Indeed in R it isn't a type come to think of it, it's a value of a type: NA_character_, NA_integer_, ... Not confident if this is what Jonathan says or my interpretation, feel free to correct me :)

null though should probably remain a type.

jonocarroll commented 5 days ago

That was the point I was making, yes - if you do want to handle the case of "integer argument which is not NA" then I think you end up going down the 'dependent types' or 'structural types' path where types can depend on values. I explored that a bit in https://github.com/jonocarroll/nonempty but I suspect it significantly increases the complexity. Adding to this is the fact that an argument is a vector, so the concept of "is not NA" is blurred by the any/all distinction.

I think I'm on the side of (1:3)[4] being entirely valid type-wise (it produces NA_integer_) so I don't think int | na is necessary anywhere. int | null, however, is getting close to a Maybe (or Option) definition and can be extremely useful and being able to construct a newtype with that definition could be powerful.

lhdjung commented 3 days ago

@jonocarroll If (1:3)[4] silently returns NA, this can easily lead to problems at runtime. I think throwing an error or a warning would be more desirable in principle, but it's unclear whether a superset of R is the right place for introducing this.

Entirely agree that an option type would be awesome!

jonocarroll commented 3 days ago

But type-wise it's correct. I suspect it's impossible to introduce a BoundsError at the Vapour level because of runtime things like

x <- 1:10
y <- read.csv("data.csv")
z <- x[nrow(y)]

Type-wise, all of that checks out, but whether or not the external data has a sufficient number of rows is extremely runtime-dependent.