vprover / vampire

The Vampire Theorem Prover
https://vprover.github.io/
Other
302 stars 52 forks source link

Subsumption and Subsumption Resolution via SAT solving #546

Closed RobCoutel closed 4 months ago

RobCoutel commented 7 months ago

In this pull request, we completely replace the code for the subsumption and subsumption resolution module by an implementation using a SAT solver to encode the problems.

The papers:

explain the details of the encodings as well as provide proofs of their soundness.

The most important parts of the PR are the following:

RobCoutel commented 7 months ago

Here find the answer of your general comments.

"1. At some point in the future I could imagine that we might upgrade Minisat to CaDiCaL, which has user propagation - this could in principle replace the custom SAT solver needed for this, although I suspect it won't be that easy. But this is not something to think about now, just a vague future idea."

Yes, I agree. This is even one of the claims of the paper if I remember correctly. I am not sure when we will be able to put this purely engineering effort together. But this could be done in the future. Realistically, I am afraid this will remain on the todo list for a long time. This would require some experimentations, but I am not sure CaDiCaL is the best solution, because it is tuned for very hard problems. In our case, we want the solver to be as fast as possible on small problems. Without further investigation, it is hard to know what would be the impact. Note that the solver is actually not MiniSAT. It is its own thing, but the AMO and substitution constraints.

"2. Some of the subsat stuff uses stuff that look like the Vampire versions, but aren't - e.g. calling assert rather than ASS. As we depend on invoking the crash routines reliably for e.g. running on remote servers, please ruthlessly hunt these down and replace them with the Vampire versions."

The subsat was entirely written by @JakobR I have to admit that I did not give it a thorough look. I think the idea was to have a Vampire independant module. I wuold like to have Jakob's opinion before doing any significant change there.

"3. In general the ground shifted a bit during the development of this work, so some things are no longer true - e.g. we have C++17 now, the allocator is no longer compulsory, etc. Where this is obvious I've pointed it out - just to give some background. Has this been tested for performance reasonably strenuously?"

In terms of practical changes, what should I update for the the allocator? I assumed I could abstract it away, but it seems from this comment that I cannot. Could you give me a few pointers there?

As for performance, we have the empirical analysis made for the paper that gave us a 36% increase in speed compared to the old implementation. However, we did not really test the portfolio mode since I was told it is highly tuned with the old subsumption and subsumption resolution code.

"5. Has this been tested for weird option configurations and weird input problems? I'd be particularly careful about polymorphic inputs. If it doesn't crash on the whole TPTP, then good."

I do not think we checked with weird options. But we ran the Benchmarks on the whole of TPTP and I don't think we fond any error there. If @JakobR could confirm that would be nice.

"PS A small request - could formatting changes be separated out in future? In general I'm happy to merge formatting changes in Vampire, some of the existing indentation is unconscionable, but I don't want to read it while looking for significant changes."

Sorry about that. I had an automatic formater that modified a lot of things without me realising it. I tried to manually undo most of it. I will pay more attention to it in the future. I hope it was not too much burden.

MichaelRawson commented 7 months ago

Note that the solver is actually not MiniSAT. It is its own thing, but the AMO and substitution constraints.

True, I should have put this more clearly. Agreed with your points about the engineering effort and CaDiCaL (potentially) being too heavyweight. For the sake of clarity: we do use Minisat internally for other purposes. We could also consider using subsat for other bits of Vampire and removing Minisat in the interest of "no more than one SAT solver at a time".

I also haven't read the subsat code in detail. I will trust @JakobR's considerable expertise and assume it works, otherwise we could be here for months.

In terms of practical changes, what should I update for the the allocator? I assumed I could abstract it away, but it seems from this comment that I cannot. Could you give me a few pointers there?

No, all is well - you can indeed abstract it away. It's just that some things are no longer necessary: STLAllocator or USE_ALLOCATOR will no longer produce a crash if they are missing, for example. They might (but probably are not unless you have measured it) be a good idea for performance.

However, we did not really test the portfolio mode

Probably fine. I don't think many people do, it's too noisy.

I do not think we checked with weird options.

OK. You may find some bugs with random testing but beware of the existing bugs I didn't fix yet.

I hope it was not too much burden.

Not at all.

JakobR commented 7 months ago

Thanks for all the comments @MichaelRawson, and thanks @RobCoutel for preparing the PR! I can take care of the subsat-related suggestions (soon). I'll answer more detailed comments then as I go through them.

Some of the subsat stuff uses stuff that look like the Vampire versions, but aren't - e.g. calling assert rather than ASS. As we depend on invoking the crash routines reliably for e.g. running on remote servers, please ruthlessly hunt these down and replace them with the Vampire versions.

Yes, good point. The reason for this is that I tested subsat during development independently from Vampire on some SAT problems. This is also why there is a separate CMake file for it. I will try to integrate it into the main file.

MichaelRawson commented 7 months ago

Thanks @JakobR! I wasn't sure if you had time for this, so it's great you can help us out. And of course there's no particular hurry - either we get this in for CASC (good news) or we don't (also good - then it gets tested for a year), so it's win/win.

quickbeam123 commented 7 months ago

Hi all, for the testing, it might be useful to merge master into this branch now, so that we know we are not diverging.

quickbeam123 commented 7 months ago

Is the only difference in Options.hpp just whitespace cleanup?

(Could you please avoid these in the future? I like to make things look nice as I move around too, but let's only do this manually and for the files you actually touch. If one hot-key press on your side, makes the reviewing (and git blaming) much harder for others in the future, I don't think it has been worth it overall.)

quickbeam123 commented 7 months ago

As @MichaelRawson suggested, would it be possible to make the old Makefile way of compiling/linking work here too?

JakobR commented 6 months ago

Hi @quickbeam123, I'm sorry but I will only get to this next week. If you want to merge it early I can send my changes in response to @MichaelRawson's comments in a separate PR.

quickbeam123 commented 6 months ago

No rush, @JakobR, I only quickly wrote down some "nice to have"s and will in the meantime go for a bit vacation :)

JakobR commented 6 months ago

Oh damn... most of my comments were supposed to be answers to @MichaelRawson's comments, but it looks like github lost the connection somehow? Edit: please click on "View reviewed changes" above to see the context (next to "JakobR reviewed ...")

MichaelRawson commented 6 months ago

Thanks for the hard work, @JakobR ! That was quick. I've replied inline where still possible, I guess GitHub doesn't like the big merges so much. Others:

I can remove MLMatcherStats, if you want.

No need, thanks for the explanation.

Using uint64_t extra_j would probably work too. Do you want me to keep the comment (or a similar one)?

Keep the comment for now, it gives at least the intuition!

JakobR commented 6 months ago

I think this is done from my side.

I'm repeating one unresolved point buried in the inline comments, but I think it's a minor issue:

quickbeam123 commented 4 months ago

Guys, my experiments suggest that something does not quite work the way it should, at least not for my favourite (pseudo-)default strategy -sa discount -awr 10. Instead of suspecting your likely-well-optimised sat-based subsumption code itself, I want to question, at least for a second, the Loop ("for forward subsumption and subsumption resolution was optimized to mutualize...") and the heuristically choices it encapsulates. I think it might be making a heuristic commitment that is detrimental for the average performance (regardless of the speed of performing the individual subsumption checks; indeed subsumption is less important for discount and less important under avatar than without it).

The comparison for -i 100000 limited -sa discount -awr 10 run over the FOF part of TPTP 9.0.0 are:

Sort by UNS
9344 ['problemsSTD_robin8250_dis10_i100000.pkl']
9347 ['problemsSTD_master7703_dis10_i100000.pkl']

which is close, but's it's quite a bit worse at the 10000 mark:

8190 ['problemsSTD_robin8250_dis10_i100000.pkl']
8219 ['problemsSTD_master7703_dis10_i100000.pkl']

(I plan to generate a full cactus plot comparison soon).

However, the main suspicious thing is the number of activations needed by the two vampires to succeed on the 9307 solved (in 100000) by both, namely: 53283652 vs 53633200.

Before I start digging deeper, could you please sum up how "the Loop" differs from the previous solution we had in Vampire? Also, what would be your favourite platform for a bit of discussion on this? (Are you, e.g., getting notifications from our zulip?)

RobCoutel commented 4 months ago

Hi Martin,

However, the main suspicious thing is the number of activations needed by the two vampires to succeed on the 9307 solved (in 100000) by both, namely: 53 283 652 vs 53 633 200.

I don't know what is meant here. Do you mean the number of subsumption resolution calls? If yes, then it makes sense. See below the difference between the old and new loops.

Before I start digging deeper, could you please sum up how "the Loop" differs from the previous solution we had in Vampire?

Previously, Vampire would first check all the subsumptions and then, if it failed, all the subsumption resolutions. However, since setting up the problems is expensive, we have changed the loop to perform both subsumption and subsumption resolution on each instance.

Old: For each candidate clause: Check subsumption For each candidate clause: Check subsumption resolution

New: For each candidate clause: Check subsumption Check subsumption resolution

The drawback is that if a clause is subsumed in the new version, we will perform useless subsumption resolution checks that the old loop does not. The advantage is that we set up both problems simultaneously, saving on pruning, matching, and index time.

This could be a problem if a certain strategy gets a lot of subsumed clauses.

I will have a look to make sure nothing fishy is happening. But what is curious is that the optimization seemed to be very beneficial on otter.

Another thing that might change the results is that there might be more than one solution for subsumption resolution. Therefore the search is affected by the method employed.

Also, what would be your favourite platform for a bit of discussion on this? (Are you, e.g., getting notifications from our zulip?)

GitHub and Zulip are both fine. But on Zulip I have 2 accounts (not super convenient, I know). I will answer faster on the one where I have a profile picture set. Zulip might be a bit more spontaneous.

Best, Robin

easychair commented 4 months ago

Hi Robin,

I am not sure I understood your message.

Are you proposing to change the main loop because you think SAT-based subsumption and subsumption resolution should be made the only way to do these two operations in Vampire?

Best, Andrei

On Tue, 16 Jul 2024 at 17:20, Robin Coutelier @.***> wrote:

Hi Martin,

However, the main suspicious thing is the number of activations needed by the two vampires to succeed on the 9307 solved (in 100000) by both, namely: 53 283 652 vs 53 633 200.

I don't know what is meant here. Do you mean the number of subsumption resolution calls? If yes, then it makes sense. See below the difference between the old and new loops.

Before I start digging deeper, could you please sum up how "the Loop" differs from the previous solution we had in Vampire?

Previously, Vampire would first check all the subsumptions and then, if it failed, all the subsumption resolutions. However, since setting up the problems is expensive, we have changed the loop to perform both subsumption and subsumption resolution on each instance.

Old: For each candidate clause: Check subsumption For each candidate clause: Check subsumption resolution

New: For each candidate clause: Check subsumption Check subsumption resolution

The drawback is that if a clause is subsumed in the new version, we will perform useless subsumption resolution checks that the old loop does not. The advantage is that we set up both problems simultaneously, saving on pruning, matching, and index time.

This could be a problem if a certain strategy gets a lot of subsumed clauses.

I will have a look to make sure nothing fishy is happening. But what is curious is that the optimization seemed to be very beneficial on otter.

Another thing that might change the results is that there might be more than one solution for subsumption resolution. Therefore the search is affected by the method employed.

Also, what would be your favourite platform for a bit of discussion on this? (Are you, e.g., getting notifications from our zulip?)

GitHub and Zulip are both fine. But on Zulip I have 2 accounts (not super convenient, I know). I will answer faster on the one where I have a profile picture set. Zulip might be a bit more spontaneous.

Best, Robin

— Reply to this email directly, view it on GitHub https://github.com/vprover/vampire/pull/546#issuecomment-2231338154, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVY4BIRAVMOY6AGAM55RCTZMVB57AVCNFSM6AAAAABGPDHLGGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZRGMZTQMJVGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

quickbeam123 commented 4 months ago

I just deleted a line

static_assert(std::is_same<subsat::allocator_type<int>, STLAllocator<int>>::value, "unexpected subsat::allocator_type");

since we don't have any STLAllocator anymore. @JakobR , could that cause trouble?

RobCoutel commented 4 months ago

Hi Robin, I am not sure I understood your message. Are you proposing to change the main loop because you think SAT-based subsumption and subsumption resolution should be made the only way to do these two operations in Vampire? Best, Andrei

Hello Andrei,

I do not touch the main loop of Vampire. I only changed the Forward Simplification loop for subsumption and subsumption resolution. This is where the optimization kicks in.

I removed the old subsumption and subsumption resolution for code maintainability. Our experiments show that the SAT method is faster by a factor of 35% on average, with a much lower variance. image

However, one thing that is not taken into account here is the runtime. We ran our experiments on a 60-second timeout with -sa otter. But the portfolio will run for shorter bursts. Our method could be worse in this case because its strength is scaling. But if Vampire only runs for a few seconds on a specific strategy, it is possible that this scaling effect is not observed.

We are looking into it with Martin.

Best, Robin

JakobR commented 4 months ago

I just deleted a line

static_assert(std::is_same<subsat::allocator_type<int>, STLAllocator<int>>::value, "unexpected subsat::allocator_type");

since we don't have any STLAllocator anymore. @JakobR , could that cause trouble?

Yes, this is fine to remove! (In an earlier version, it was possible to change the SAT solver's allocator. This assertion checked that when used within Vampire, the correct allocator is set.)