Evaluation data - Githubissues

siddhartha-gadgil commented 1 year ago

Current status

There are currently three files in this repository that contain together 120 test statements:

data/false-prompts.txt
data/silly-prompts.txt
data/thm-prompts.txt

There is also a lean program that can run with various configurations to see how many of these elaborate. After the setup following README.md an example run is:

build/bin/bulkelab silly 10 4 false 15 8

This attempts to elaborate all statements in silly-prompts.txt with 10 example prompts based on sentence simliarity, 4 based on keywords with 15 Codex completions for each statement at temperature 0.8.

siddhartha-gadgil commented 1 year ago

Testing performance

Two ways one can test performance:

(currently used): Use elaboration as a proxy.
Make a round-trip translating the lean statement back to natural language using the same prompts reversed. The final sentence should be close enough to the initial one. For close enough, we can:
- compare with distance to other prompts in the database, or
- simply query Codex whether the statements have the same meaning (with confidence level).

siddhartha-gadgil commented 1 year ago

Guidelines for expansion

We are working with a partial binary port of mathlib. So all examples should be such that they work with this, and ideally there are related prompts in clean_prompts.json

siddhartha-gadgil commented 1 year ago

Stale, not clear what this means in terms of working now.

siddhartha-gadgil / LeanAide

Evaluation data #1

Current status

Testing performance

Guidelines for expansion