nuprl / MultiPL-E

A multi-programming language benchmark for LLMs
https://nuprl.github.io/MultiPL-E/
Other
201 stars 38 forks source link

Racket unit test numerical equivalence #60

Closed PootieT closed 1 year ago

PootieT commented 1 year ago

Example program: HumanEval_99_closest_integer

This is the current test

    (check-equal? (candidate "14.5") 15)

which outputs:

--------------------
FAILURE
name:       check-equal?
location:   problem.rkt:27:4
actual:     15.0
expected:   15
--------------------

Here are some alternatives we may consider (source):

    (check = (candidate "14.5") 15)
    (check-= (candidate "14.5") 15 0.01)
    (check-within (candidate "14.5") 15 0.01)

All of them would pass with the same inputs. The second and third version checks equivalence with small error range.

arjunguha commented 1 year ago

Agreed. But, we may need to generalize this to work on lists of numbers as well.

PootieT commented 1 year ago

seems like check-within allows comparison in between lists

(check-within (list 0 2.0 3 5 9 123) (list 0 2 3 5 9 123) 0.01) ; passes

although, in this one weird case, one program returned a set, with all elements the same as the expected output, but as a list, and in this case, no current checking method allows the two values to be the same. Perhaps for the best..

(check-match (set 0 2 3 5 9 123) (list 0 2 3 5 9 123)) ; does not pass
arjunguha commented 1 year ago

Conveniently, it seems like check-within supports heterogeneous lists too:

Welcome to Racket v8.2 [cs].
> (require rackunit)
> (check-within '("hi" 2) '("hi" 2.001) 0.05)
> (check-within '("hi" 2) '("hi" 2.1) 0.05)
--------------------
; FAILURE [,bt for context]
name:       check-within
location:   readline-input:3:0
actual:     '("hi" 2)
expected:   '("hi" 2.1)
--------------------
> 

So, we should be able to just use check-within instead of check-equal?

arjunguha commented 1 year ago

Fixed. Racket performance on a model increases slightly from 10.62% to 11.19%. I suspect with better Racket training data, it will have more of an impact.