Use docstrings from definitions as well as theorems

siddhartha-gadgil / LeanAide

Tools based on AI for helping with Lean 4

Apache License 2.0

65 stars 5 forks source link

Use docstrings from definitions as well as theorems #9

Closed siddhartha-gadgil closed 1 year ago

siddhartha-gadgil commented 2 years ago

In many cases terminology appears only in the definition which causes some of the errors of #2
To fix this, we do the following:
[x] in docprompts.sc we should not filter by "kind" but make this a field.
[x] regenerate parsed_thms.txt using datagen.
[x] Using elabdatagen update elab_prompts.txt
[x] With the new filter, regenerate safe_prompts.json.
[x] We rerun to get new embeddings: this has to be a different path, not just branch.
[x] In using statements in prompts, we can use the kind instead of "theorem"
[ ] All this should be merged on verifying there is no regression.

siddhartha-gadgil commented 2 years ago

@0art0 The keywords need to be regenerated on the branch defdocs to include "kind" from data/safe-prompts.json on the same branch.

siddhartha-gadgil commented 2 years ago

There is regression in some cases, probably because theorem types have more information than definition types. Before using this we may wish to have a penalty for kind not being "theorem"

siddhartha-gadgil commented 2 years ago

Using both pairs and triples is now part of the code: de73787aa5aad4915f So only very routine stuff is needed before merging:

[x] separate numbers for prompts from theorems-only and from full database
[x] whether to reverse as a parameter
[x] custom prompts tweaked and at the end (the user gives the statement as a pair, including "theorem", "def" etc.

siddhartha-gadgil commented 2 years ago

The routine stuff is done in 343e2b5. Only testing remains before merging

siddhartha-gadgil commented 1 year ago

A lot has changed since the issue was made with the switch to OpenAI embeddings making a lot of the above obsolete. When we next test we should also test with mixed collection of prompts.

siddhartha-gadgil commented 1 year ago

With OpenAI prompts it may be best to add to the loaded (and pickled) array a Boolean whether something is a Prop, and compute the embeddings for everything. Then the distance can give a penalty for not being a Prop, or simply exclude non-props.