A propagator-based approach to timeout a solve?

potassco / clingo

🤔 A grounder and solver for logic programs.

https://potassco.org/clingo

MIT License

606 stars 81 forks source link

A propagator-based approach to timeout a solve? #473

Open mbalduccini opened 9 months ago

mbalduccini commented 9 months ago

Hi,

In my investigation of the unexpected performance results I mentioned in https://github.com/potassco/clingo/issues/453, I attempted an alternative way of triggering a timeout in Control::solve() instead of using SolveHandle::cancel(). The approach, which I detail below, is motivated by the study of the implementation of SolveHandle::cancel() and especially of Solver::setStopConflict(). In experiments on (a small number of) challenging optimization problems, my approach almost completely eliminated the issues I had previously reported in https://github.com/potassco/clingo/issues/453, where I used Solve::cancel(). I was wondering if you see any problems with my approach and in general if you have any feedback or thoughts about it.

The approach is:

Before calling Control::solve(), I register a propagator via clingo_control_register_propagator() (it should be possible to use the C++ equivalent with the same results)
When I want to trigger a timeout, I set variable timeout_triggered to true and wait forever with SolveHandle.wait(-1.0)
When timeout_triggered is true, the propagator that I registered:
1. Takes the first literal, call it lit, from the change list that clingo passed to the propagator
2. Adds a clause { -lit } of type clingo_clause_type_volatile
3. Calls clingo_propagate_control_propagate()
4. Returns (with true unless clingo_propagate_control_propagate() failed)
SolveHandle.wait(-1.0) eventually returns and the main branch sets timeout_triggered back to false
Control::solve() can now be called again as needed

The intuition behind the approach is that, after timeout_triggered is set to true, the propagator causes all propagation attempts to fail and the solve eventually terminates. Then, when Control::solve() is called again, the clauses that the propagator had added are automatically removed since they were marked volatile, and thus the computation can resume without any permanent impact from the timeout.

Thanks! Marcello

rkaminsk commented 9 months ago

This approach should work. You could even add an empty clause to backtrack to the top-level right away. Do you watch all literals? You can also set the check mode to fixed point, and add a clause on the propagation fixed point.

@BenKaufmann, it sounds a bit strange that it is faster to stop the search like this. Do you think one could simply improve the implementation?

mbalduccini commented 9 months ago

Thanks!!! I hadn't thought about an empty clause, I'll definitely try and report back if I notice significant differences. I am indeed watching all literals. I am not familiar with setting the check mode to fixed point, I'll look into it.

To be clear, this approach does not make it faster to stop the search -- the stopping time is negligible for both cases from what I saw. What my approach does is:

It makes the second solve arrive faster at answer sets that are at least as good as the ones found by the first solve. Recall that I am running solve() and cause a timeout after a certain amount of time. Then I run solve again, supposedly in an incremental fashion. If I time out the first solve with cancel(), there are cases in which the first solve arrives at a certain cost in 5 sec, at which point the timeout occurs. Then, the second solve is unable to get to that same cost in as many as 60 sec. That's surprising since the two solves are done incrementally, and so I would have expected the second solve to start more or less from where the first solve had stopped.
The approach also makes the performance of the second solve more consistent. When I use cancel() in problems where the second solve is given enough time to complete the search, the time taken by that second solve varies quite a bit from run to run (on the same problem instance). The time for the second solve in the slower runs is twice as long as the time in the faster runs. (Again, a "run" is just me rerunning the program on the very same problem instance and always with the default random seed.) Instead, in the approach I described above, the time difference between faster and slower second solves is negligible -- and comparable with the fastest second solves that I get with cancel().

rkaminsk commented 9 months ago

It's strange that this make the second run more stable. I have no idea why this is the case. Could you share a minimal working example showing the difference. I cannot promise that we will take a look right away but we'll keep it on the stack.

Regarding the check mode have a look at:

mbalduccini commented 9 months ago

Thanks!! I'll send you files tomorrow.

mbalduccini commented 9 months ago

Here are the files: files.zip

exp-timeout.cc is the program I use to compare the two approaches. It loads the aspif file for an minimization problem, runs it for a given amount of seconds, and then re-runs it incrementally for 60 sec. No particular flags needed to build it, just compile it against libclingo. I observed no difference in behavior between clingo 5.6.2 and the wip branch. Usage info:

exp-timeout [-v] --abortive|--graceful <solve 1 timeout in sec> <aspif file>
-v (verbose) causes the program to display the cost of intermediate answer sets
--abortive uses cancel() to trigger a solver timeout
--graceful uses the 'graceful' approach to trigger a solver timeout
Use a timeout of 1 (sec) for small.aspif and 5 (sec) for the other files
The second solve has a fixed timeout of 60 sec

secondOptWorse.aspif: the program is solvable in ~1 hour. When we sequence a 5 sec timeout solve and a 60 sec timeout solve and use cancel() for the timeouts, the best cost found by the second solve is almost always worse than that of the first solve. If we use my approach (called graceful in exp-timeout), the best cost found by the second solve is almost always better than that of the first solve.
small.aspif: a much smaller program that may provide some insights into what is happening, especially when one runs it with -v. I run it with a 1 sec timeout on the first solve. On my machine, that makes clingo get very close to the optimal cost of 210 but not quite there. Usually it gets to 212. The second solve produces a first (sub-optimal) answer set with cost between 270 and 300. With cancel() it is more frequently in the 280-300 range. With graceful timeouts, it is more frequently around 270. Maybe it's just due to when exactly the timeout occurs but I am pointing this out in case it's useful in examining the difference between the two timeout approaches.

rkaminsk commented 8 months ago

Thanks for the files. We'll try to investigate at some point.

mbalduccini commented 8 months ago

Thanks! I have an update that hopefully will simplify your investigation. I found that the strange behavior occurs with established benchmarks as well. I am attaching a new archive that contains (a) a simpler C++ program and (b) instances from an ASPCOMP 2014 benchmark domain where the issue occurs. (The issue occurs in other benchmark domains as well, but it is less evident.) Differently from the file I previously sent you, this new C++ program processes asp files rather than aspif files -- I did so because I wanted to rule out that the issue might be due to how I was loading the aspif files. You can reproduce my results with:

./exp-timeout-asp_only --abortive 5 <instance>

For your convenience, the instances already include the problem encoding.

Sample run:

./exp-timeout-asp_only --abortive 5 BENCHMARK-SET-asp-w_seed/CrossingMinimization-O10/0082-crossing_minimization-9-0.complete

produces an output such as:

STARTING
TIMEOUT!
best costs: 197 
solve: Elapsed: 5.22247
STARTING
TIMEOUT!
best costs: 275 
solve: Elapsed: 61.9678

Similarly to my previous program, the output includes the outcome of two consecutive solves executed incrementally. The first solve uses the timeout specified by the second command-line argument (5 sec in the example) and the second solve has a fixed 60 sec timeout. The important information is on the two "best costs:" lines. The first one reports the cost of the best answer set found during the first solve. The second one reports the cost of the best answer set found during the second (incremental) solve. You can see in the example output how the second solve fails to get a cost close to that of the first solve in spite of having a substantially longer timeout and being executed incrementally. I ran all of the included instances 3 times with different random seeds (2550, 5150 and 8480, which you can specify by appending --seed <n> to the command line). You can find my results on all sample instances in file BENCHMARK-SET-asp-w_seed/results.csv.

files.zip

BenKaufmann commented 8 months ago

@mbalduccini @rkaminsk I investigated this issue today and AFAICS it all boils down to heuristic effects.

The only conceptual difference between interrupting the search via a call to SolveHandle::cancel() (i.e. Solver::setStopConflict()) and adding conflicting clauses (or an empty clause) from a propagator is that the latter will mark the solve step as completed with a final unsat result. Hence, for an optimization problem clasp would then (wrongly) report that the solution is optimal under the given assumptions. However, since volatile/enumeration specific knowledge is removed before the next solve attempt, this is mostly a cosmetic problem.

Regarding heuristics, the issue is twofold. First, because of the timeout, the result of the first search is highly non-deterministic. Second, the suggested way that conflicting clauses are added in the propagator influences certain heuristics. As written, the propagator adds conflicting clauses with a literal from the highest decision level. When resolving the conflict, the solver will backtrack without saving phases of literals from the highest level. Furthermore, it will then add a new volatile clause thereby potentially bumping scores of the contained variables. Finally, depending on the problem and search state, the propagator might need to add multiple such clauses before the step becomes unsat. A simpler and less intrusive approach would be to just add an empty clause in step 3 as @rkaminsk already suggested. This way, the step becomes unsat immediately.

To get a better understanding of what is happening, I replaced the time-based approach with a deterministic conflict-based approach. I.e. instead of stopping the search after a given time, I stopped the search after a fixed number of conflicts (e.g. 8500/50000 for the CrossingMinimization problems). With this and the simpler empty-clause based propagator, the two approaches produce the exact same search spaces. For example, with seed = 2550, I got:

Instance	First	Prop-Def	Empty/Cancel
0050-crossing_minimization-10-0	171	154	182
0069-crossing_minimization-9-0	199	249	226
0073-crossing_minimization-9-0	191	235	219
0081-crossing_minimization-9-0	190	207	217
0082-crossing_minimization-9-0	197	256	235

Finally, as mentioned by @rkaminsk in https://github.com/potassco/clingo/issues/453, option --forget-on-step can be useful in this scenario. For example, when setting --forget-on-step to "varScores,signs", the second search produced better results than the first most of the time.

In summary, I don't think there is an easy/obvious "fix" for the observed behavior. We could think about adding some kind of "cancel/resume heuristic" and/or adding special handling for the case where the logic program is not changed between two consecutive solve calls.