leanprover-community / aesop

White-box automation for Lean 4
Apache License 2.0
199 stars 26 forks source link

Necessity of unsafe weights #16

Closed Kha closed 2 years ago

Kha commented 2 years ago

The use of weights for proof search makes complete sense, but it also seems basically impossible to make an informed guess at what weights to choose when adding a new rule. As long as the proof terminates in a reasonable amount of time, I assume they do not matter much. Isabelle's auto does not require weights for unsafe rules (not sure if it even has them tbh), which is more user friendly. Should there be something like a default weight, just like safe rules have a default penalty?

/cc @astahfrom

astahfrom commented 2 years ago

Having used Aesop for a small project, I agree that default weights or at least a guide with heuristics would be very welcome!

Isabelle's auto does not have weights, as it uses a depth-first approach. I guess that rule application is cheap enough there, that it can get away with this, while in Aesop, the best-first search can spare a lot of work on futile branches. It would be nice to have this confirmed empirically, as the depth-first approach would not only remove the need for weight, but also simplify the metavariable handling (right?).

If users inadvertantly choose poor weights, and we cannot provide a good default, then best-first search may no longer be worth it?

Kha commented 2 years ago

Huh, I didn't know that about auto! But even if users always chose a default weight, that would make it equal to breadth-first search, which might still scale better than (depth-limited) depth-first search, no? Not in all cases of course, and the question about complexity is warranted as well.

JLimperg commented 2 years ago

What's so difficult about the rule weights? I envisioned something like 'last resort - 1%, unlikely - 25%, even - 50%, likely - 75%'.

Regarding alternatives:

Kha commented 2 years ago

It's not really about saving me from typing, but from having to make any decision at all at that point (which documentation stating "when unsure, 50% is a sensible default value" would also more or less achieve) as well as documenting in the code that I didn't make a decision, which is different from explicitly choosing 50% when I in fact think that's a good approximation.

astahfrom commented 2 years ago

My difficulty is the following. Say I have some nice lemma and I want to register it as an unsafe rule so it is automatically applied for me. Now I find myself imagining all possible future proof obligations where my lemma applies: will this be useful for 1% of them? 25%? 50%? This feels very hard to quantify.

alcides commented 2 years ago

As a side-comment, this seems like the kind of thing reinforcement learning could be used for. Not necessarily using neural-networks, but as dynamic probability adjustment. The challenge would be defining a metric that allows to measure progress (maybe minimizing open goals?).

It's mostly about term, and not tactic mode, but we have published some preliminary work. We intend to evaluate this in the context of synthesizing lean proofs, and Aesop may be a good shortcut for the implementation, because of the explicit weights.

JLimperg commented 2 years ago

Unsafe rules now have a default success probability of 50%. I think that's all that's likely to happen in the short term, so I'll close the issue for now.

@alcides I've thought a bit about fancier prioritisation methods as well. I won't implement anything like that right now because it's unclear whether Aesop is even suitable for large searches where accurate prioritisation would make a difference. We also don't have data yet to quantify, say, the usefulness of particular rules.

On a more conceptual level, I designed Aesop as a white-box search tactic where users can understand and control what is going on. This is one of the few advantages it has over a hammer or an SMT backend, which are more powerful but also notoriously opaque. Putting any sort of automatic prioritisation in there compromises this white-boxiness, so I'm not too keen on it. However, it may turn out that users don't care about this sort of control anyway (as this thread suggests, and I've heard similar things about Isabelle's auto); in that case I'll have to rethink the design.

Of course, if you'd like to use Aesop for experiments I'd be honoured and happy to help.