lean-dojo / ReProver

Retrieval-Augmented Theorem Provers for Lean
https://leandojo.org
MIT License
218 stars 48 forks source link

Format state_before/state_after datapoints as partial updates #17

Closed antonkov closed 1 year ago

antonkov commented 1 year ago

Generate datapoints in the new format: 1) we suspect it will make it easier for ReProver to predict tactics 2) it will make it easier for paperproof to provide the after state

with following differences from simple state_before/state_after approach: 1) Remove case goalName from before&after states 2) Leave the 1st goal in the before state, and only leave the goals it forked into in the after state. 3) In the after state, only show to ReProver the added and the updated hypotheses. Do not mention anything if the hypothesis was deleted. 4) From each example (like the apply And.intro example below), generate multiple examples, one per each diff. In paperproof, the user shouldn't need to write out every change the tactic would bring. rw might affect 3 hypotheses in a tactic, the user will only mention one of those, and expect ReProver to certainly do that change, without worrying about other hypotheses not being mentioned in the after state. It's like a deepening lense, the model sees that it has a particular before state ⊢ p ∧ q, and it sees that it wants to have ⊢ p. The model does not care about anything else.

Examples: 1) Previous format for tactic trivial

before
case inl
p : Prop
h✝ : p
⊢ True

case inr
p : Prop
h✝ : p
⊢ True

after
case inr
p : Prop
h✝ : p
⊢ True

new format

before
p : Prop
h✝ : p
⊢ True

after
goals accomplished

2) new format for tactic intro h generates two datapoints

before

⊢ p -> q

after
...
h : p
⊢ q
before

⊢ p -> q

after
...
⊢ q

3) new format for apply And.intro generates two data points

before
q : Prop
p : Prop
hp : p
hq : q
⊢ p ∧ q

after
...
⊢ p
before
q : Prop
p : Prop
hp : p
hq : q
⊢ p ∧ q

after
...
⊢ q

4) new format for rw [Nat.add_comm a b]

before
a : Nat
b : Nat
c : Nat
⊢ c + (a + b) = c + b + a

after
...
⊢ c + (b + a) = c + b + a

5) for synthetic example with old format

before 
a : 1
b : 2
c : 3
⊢ goalA

after
a : 666
b : 2 
c : 3
m : 777
⊢ goalA

new format is two separate datapoints

before
a : 1
b : 2 
c : 3
⊢ goalA

after
...
a : 666
⊢ goalA
before
a : 1
b : 2 
c : 3
⊢ goalA

after
...
m : 777
⊢ goalA
yangky11 commented 1 year ago

I didn't understand this data point generated by intro h:

before

⊢ p -> q

after
...
⊢ q

Where is h : p?

antonkov commented 1 year ago

The idea is - we want the model to respond to two types of queries:

1) Specify new desired goal

before
${current_state}
⊢ ${current_goal}

after
...
⊢ ${new_goal}

2) Specify the desired hypothesis

before
${current_state}
⊢ ${current_goal}

after
...
${new_hypothesis}
⊢ ${current_goal}

The datapoint without any new hypothesis is to serve the query 1) - i.e. user doesn't specify what hypothesis appear only the goal change, ... is to convey that maybe there are changes in the state but we can't tell which changes exactly in the query and also don't care. It might feel more natural on the example by cases type Color = Red | Blue | Green, ⊢ \forall c : Color, P c and user specifies the goal ⊢ P Blue (it will create goals P Red and P Green as a byproduct but that's we assume what user wanted).

The other datapoint with h : p is to serve the 2nd type of queries when user wants a hypothesis change, even if it requires the goal change. However, you can notice the discrepancy in the query compared to the training datapoint since the old goal will have to be specified in the query (we don't know what new goal is but want to preserve old one if possible):

before

⊢ p -> q

after
...
h : p
⊢ p -> q

I'm not sure if the model would find it close to the inro h from training example, please let me know what you think and if it needs adjustment. Please also note that the last snippet is a query example not a training datapoint, all training datapoints are consistent and valid with a semantic of ... possibly hiding some hypothesis. Thanks

yangky11 commented 1 year ago

That makes sense. Thanks for the explanation. I think it should be good as long as examples in testing are presented to the model in the same format as in training. I'll start training the model with updated data asap, which will finish in one week.

yangky11 commented 1 year ago

@antonkov It looks like the code doesn't handle state_after_str == "no goals"? It leads to an AssertionError at assert lines[-1].startswith("⊢").

antonkov commented 1 year ago

Oh right, good catch. Should be straightforward to add a check if state_after_str == "no_goals": goals_after = [] can you do that?

yangky11 commented 1 year ago

@antonkov I fixed this error, but there are other cases that cannot be handled by the parsing code. For example, parse_hyps cannot parse inst✝⁷ : (x : B) → AddCommMonoid (E x) due to the additional : in the type. The LeanDojo repo has code for parsing the pretty-printed goals: https://github.com/lean-dojo/LeanDojo/blob/main/src/lean_dojo/interaction/dojo.py. Maybe you can directly use it or adapt it somehow (opening a new PR based on the updated given-next-state branch)?

yangky11 commented 1 year ago

An example goal that your current code cannot parse:

R : Type u_2
B : Type u_1
F : Type u_3
E : B → Type ?u.410614
inst✝⁸ : NontriviallyNormedField R
inst✝⁷ : (x : B) → AddCommMonoid (E x)
inst✝⁶ : (x : B) → Module R (E x)
inst✝⁵ : NormedAddCommGroup F
inst✝⁴ : NormedSpace R F
inst✝³ : TopologicalSpace B
inst✝² : TopologicalSpace (TotalSpace E)
inst✝¹ : (x : B) → TopologicalSpace (E x)
inst✝ : FiberBundle F E
ι : Type u_4
Z : VectorBundleCore R B F ι
b✝ : B
a : F
i j : ι
b : B
hb : b ∈ baseSet Z i
v : F
⊢ Trivialization.symm (localTriv Z i) b v = ↑(coordChange Z i (indexAt Z b) b) v
antonkov commented 1 year ago

Yes thanks, I'll iterate to fix the errors and send a new PR