nuprl / MultiPL-E

A multi-programming language benchmark for LLMs
https://nuprl.github.io/MultiPL-E/
Other
201 stars 38 forks source link

R unit test comparison between integer and double #55

Closed PootieT closed 1 year ago

PootieT commented 1 year ago

This is a very nuanced difference but is actually the cause of at least 10% of the issues in R unit tests (in HumanEval I have seen):

Example: HumanEval_60_sum_to_n:

sum_to_n <- function(n) {
  return(sum(0:n))

}
test_humaneval <- function() {
candidate <- sum_to_n
    if(!identical(candidate(1), 1)){quit('no', 1)}
    if(!identical(candidate(6), 21)){quit('no', 1)}
    if(!identical(candidate(11), 66)){quit('no', 1)}
    if(!identical(candidate(30), 465)){quit('no', 1)}
    if(!identical(candidate(100), 5050)){quit('no', 1)}
}
test_humaneval()

this is not working because sum returns an integer, and we are comparing it with float in the output. The identical comparator requires the types of variables to be the same. In these cases, the type should not matter.

My suggestion might be to change identical to == comparator, but only for these single numeric value comparisons. so in this case:

candidate <- sum_to_n
    if(!(candidate(1)== 1)){quit('no', 1)}
    if(!(candidate(6)== 21)){quit('no', 1)}
    if(!(candidate(11)== 66)){quit('no', 1)}
    if(!(candidate(30)== 465)){quit('no', 1)}
    if(!(candidate(100)== 5050)){quit('no', 1)}
}

@mhyee Maybe you have an idea how to fix this in the transpiler code? CC @arjunguha

mhyee commented 1 year ago

One hacky approach might be to check if right (which is a string at this point) contains c( or list(.

return "    if(!identical({}, {}))".format(left, right) + "{quit('no', 1)}"

https://github.com/nuprl/MultiPL-E/blob/main/dataset_builder/humaneval_to_r.py#L108

A completely different approach is to translate literals differently: in Python, numeric literals default to integers (so 1 is an integer and 1.0 is a float), but in R, numeric literals default to floats (so 1 is a double and 1L is an integer). But this would change all the tests completely.

PootieT commented 1 year ago

Thanks @mhyee ! I will do the hacky approach for now on my local branch!

FYI, for the hacky check condition, one would need to check

if not any([right.startswith(w) for w in ["c(", "list(", "NULL"]]):

since NULL comparison requires identical().

arjunguha commented 1 year ago

@mhyee should we change all tests completely? It's not a big deal to do so.

arjunguha commented 1 year ago

@PootieT any reason not to use == everywhere like you suggest? I guess its more lax about types. But, in these benchmarks we know that we shouldn't be doing heterogeneous comparisons.

mhyee commented 1 year ago

I'm actually not sure we should change the tests to use integers. I think most people just use doubles by default, e.g. nobody writes c(1L, 2L, 3L). However, the : operator produces integer vectors.

We can't use == for everything because if (x == y) is an error if x or y have more than one element.

But I'm looking at the docs, and we could use some combination of ==, all, and isTRUE. There's also all.equal, which does "near equality" (useful for doubles). Let me think about it a bit more.

PootieT commented 1 year ago

Yeah, so far, locally the hacky solution works for me and I don't see any additional weird cases where the program looks correct but fails unit test. BUT, one case it can fail is when we have two lists with integers and doubles, then identical would fail c(1L,2) and c(1,2) for instance.

One additional reason we can't use == for all is NULL == NULL would yield NA I think, instead of a boolean (and == doesn't work between lists).

I think we can probably do:

since all.equal seems to disregard types of each element when comparing numerical values and also handles NULL type

all.equal(list(NULL,1), list(NULL,1))    # returns TRUE
all.equal(list(1L,2), list(1,2))    # returns TRUE
all.equal(c(1L,2), c(1,2))    # returns TRUE
mhyee commented 1 year ago

I think we can use isTRUE(all.equal(x, y)) for all comparisons. That keeps things simpler, instead of trying to detect what the values are and then changing the comparison function.

> isTRUE(all.equal("abc", "def"))
[1] FALSE
> isTRUE(all.equal("abc", "abc"))
[1] TRUE
> isTRUE(all.equal(1:2, c(1,2)))
[1] TRUE
> isTRUE(all.equal(1L, 1.0))
[1] TRUE

I'll push a fix to the PR.

PootieT commented 1 year ago

looks good!