Slow Petri Net Unfolding (the Bumblebee Observations)

Heizmann commented 5 years ago

The Petri net unfolding is at the moment (early Oct 2019) the major bottleneck of the Petri Automizer verifier. The Petri net unfolding seems to be unnecessarily slow.

The following list of observations should help us to discuss improvements.

The Dinosaur Observations:

D01: Typically, we have dozens of conditions per place.
D02: Only few candidate for possible extension finally evolve to a possible extension
D03: >90% CPU time is spend for checking if two conditions are in co-relation, but the implementation of this check is efficient

The Bumblebee Observations:

B01: We instantiate conditions in candidates incrementally, approx 10% of the initial 1-condition instantiated candidates finally evolve to at least one possible extension.
B02: While evolving a candidate we try to instantiate places with conditions and iterate over all conditions (dozens, see D01) that have a certain place.
B03: Because of B02 the high cost are probably not due to the high number of the 1-condition instantiated candidates but due to the high costs for evolving a candidate.
B04: Many instantiations fail because the condition is successor of a cut-off event.
B05: The datastructures of the ConditionEventsCoRelation would allow us to get all conditions that are in co-relation to a given condition c. The number of all co-related conditions is typically high. Iterating over all of them is not efficient.
B06: These datastructures would not allow us to get all conditions that are in co-relation to a given condition c and have a certain place p.
B07: The B06 problem can be addressed: Do not store binary condition->event relation, but ternary condition->transition->event relation. Given condition c and place p, when asked for {c'| c' in co-relation to c and has place p} then find all predecessor transitions of p (separate tracking necessary, Petri nets only provide forward successors) get co-related event and filter successors for places.
B08: We could save even more if we omit cut-offs from co-relation. Check with other users (backfolding, LBE) if this could become the default.
B09: Memory optimization for internal data structure: We store for each condition the co-related events. Idea: Store the information only for the parent event (is a subset) and only the remaining diff for the condition. Saves probably also runtime because we have to copy fewer relations. Makes implementation more complex. (Probably wait until this becomes a bottleneck)
B10: We could use co-relation information to reduce the number of 1-condition instantiated candidates in the first place. Orthogonal to B07. Requires additional callbacks for the on-demand construction. Makes implementation significantly more complex. Do not implement until this becomes the dominant bottleneck of the algorithm.

Heizmann commented 5 years ago

B11: The method BranchingProcess#isOneSafe is also very costly. If we trust the input to be 1-safe. Otherwise we could check 1-safety more efficiently by checking that each place occurs at most once in each local configuration.

Heizmann commented 5 years ago

For benchmarks on the set above, CONDITION_PER_PLACE_MAX is between 90 and 25000, and CO_RELATION_MAX_DEGREE is between 3000 and 70000. CsvAutomataOperationStatistics107285100_2019-10-13_14-33-48-842.csv.txt java.hprof.txt The total runtime of the subTwoHourMarathon subset is 763s.

Heizmann commented 5 years ago

For the benchmarks in the set above: https://github.com/ultimate-pa/ultimate/commit/ae0a586ed02fa02de0cdf0cc5c54bfa94b436380 sped up the unfolding for larger benchmarks by factor 3 and reduced co-relation queries by up to 40%. (@naouarmehdi unfortunately, a side-effect is that benchmarks done before and after this commit are not comparable wrt. number of co-relation queries.) CsvAutomataOperationStatistics107285100_2019-10-13_14-57-02-088.csv.txt java.hprof.txt The total runtime of the subTwoHourMarathon subset is 310s.

Heizmann commented 5 years ago

B12: If we run the verification on larger examples, timeouts occur almost always in these FinitePrefix operations that are applied to a DifferencePairwiseOnDemand. These operation is done in order to enhance the Floyd-Hoare automaton on-demand (constructing the Floyd-Hoare automaton eagerly is no alternative for large programs).

Heizmann commented 5 years ago

B13: The subTwoHourMarathon subset is more relevant for program verification than the other four benchmarks in the set. It seems that we have different bottlenecks for both sets. The hprof output is dominated by the four more time-consuming but less important benchmarks.

Heizmann commented 5 years ago

Optimization mentioned in B07 reduced runtime on the subTwoHourMarathon to 250s but increased rumtime on some of the other four benchmarks. CsvAutomataOperationStatistics589166341_2019-10-20_03-48-32-850.csv.txt java.hprof.txt

Heizmann commented 5 years ago

B14: The optimization mentioned in B08 did not had any signifiant effect on the runtime for our benchmark set (hence optimization disabled by default). I am totally surprised. Maybe the other bottleneck (costs for comparing events wrt. ERV order) dominates everything.
B15: On the subTwoHourMarathon benchmark set, 50%-70% of all events are cut-off events. I was surprised about that from the start and I still do not have an intuition what this tells us about the structures of the unfolding or the net.
B16: Observation B15 hints that we should try (resp. fix) the fastpath optimization in the PossibleExtensions: Whenever we add an element to the queue, we do an additional cut-off check. If positive, the event is moved to the fastpath queue.

Heizmann commented 5 years ago

B17: Observation of @naouarmehdi in Today's discussion. If we use a total order (e.g., the ERV order) and check if an already added event e1 is a companion a not yet added event e1, we can omit the check e1<e2 because the result is always true.

Heizmann commented 5 years ago

The optimization mentioned in B17 reduced the runtime on the subTwoHourMarathon set to 175s. (!) CsvAutomataOperationStatistics589166341_2019-10-26_03-48-44-805.csv.txt java.hprof.txt

Maximal size of possible extensions reaches from 147 to 2969 on the benchmark set.

naouarmehdi commented 5 years ago

B18: let m be the marking of an event e. We dont need to add the Pair (m,e) to mMarkingEventRelation in the Branching Process class if e is a cutoff event since for some e' (the company of e) (m, e') is already in the relation and e'<e

Heizmann commented 5 years ago

B19: The optimization https://github.com/ultimate-pa/ultimate/commit/02c021460ced74ba067e1478aefa841f59aa68f0 reduced the runtime on the subTwoHourMarathon set to 148s. CsvAutomataOperationStatistics1957175487_2019-11-16_02-04-35-341.csv.txt java.hprof.txt Hence I will remove the boolean flag and the old code.

Heizmann commented 5 years ago

The Optimization mentioned in B18 reduced the runtime on the subTwoHourMarathon set to 143s. CsvAutomataOperationStatistics1957175487_2019-11-16_02-32-34-977.csv.txt java.hprof.txt

Heizmann commented 5 years ago

An optimization not discussed here, related to commit https://github.com/ultimate-pa/ultimate/commit/192202ccb26caed483646b4455aa7b50d1a2efd7 reduced the runtime on the subTwoHourMarathon set to 135s. CsvAutomataOperationStatistics1957175487_2019-11-18_04-38-23-835.csv.txt java.hprof.txt BeforeAfterInterleaved.csv.txt

Heizmann commented 5 years ago

B20: CuttOffCheckingPossibleExtensions#firstbornCutoffCheck is now the most time-consuming method. So at least for this benchmark set, we can only improve if we find a way to compare two events more efficiently or if we find a different order.

Heizmann commented 5 years ago

B21 The optimization b9380eb reduced the runtime on the subTwoHourMarathon set to 105s. (!) CsvAutomataOperationStatistics1957175487_2019-12-01_00-40-36-580.csv.txt java.hprof.txt This optimization reduced the number of Event comparisons to around 40%-70%.

Heizmann commented 4 years ago

I accidentally run the last benchmark without the firstborn cutoff check. If this check is also enabled the runtime is only 96s. CsvAutomataOperationStatistics1513124396_2019-12-18_01-31-41-453.csv.txt According to the hprof output, removeMin is now probably the most time-consuming method.

Heizmann commented 4 years ago

B22 One reason for what is mentioned in D02 above: I we added a new condition c and consider candidates for possible extension, we consider all outgoing transitions of c's place. However, some of the transitions might have a preset of places where not every place has (yet) a corresponding condition in the unfolding. This problem is especially important in program verification because there we construct a Petri net for a language difference on-demand and the operand has many dead transitions. For one difference operation this problem has been partly addressed in https://github.com/ultimate-pa/ultimate/commit/5bd9aded9d7e08764b880278f30eb242bc97ef54.

An optimization that I call the candidate preselection optimization: Let c be a condition and p its corresponding place. Instead of asking p for successor transitions, we first compute the set P of all places that for each place p' in P there is a condition c' that is in co-relation with c and whose place is p'. Now we do not ask for all successor transitions of p but for successor transitions of p wose preset is a subset of P.

naouarmehdi commented 4 years ago

B23 We can improve the runtime of the computation of the minimums of configurations by storing the minimal distance of an Event to the dummy event. The minimum of a local configuration is the set of events having the distance 1 to the dummy event. If we call the removemin method then the new minimum is given by the set of events having the distance 2 to the Dummyevent and so on. Ps: The distance of an event to the dummyevent is not the Number of Anscestors. We define the distance of an Event to the dummyevent denoted by d(e) as follows:
- d(dummyEvent) = 0
- d(e) = max(d(e_1), ..., d(e_n)) +1 for e_1, ..., e_n the predecessors events of e

Heizmann commented 4 years ago

This definitely makes sense. Maybe we can define a new total order as an alternative to the EsparzaRömerVogler order ≺_E from Section 5 of the paper. We could define χ (or ψ) as an alternative to φ. We define χ as a map from natural numbers to the sets of events whose depth is that number. Maybe let us call this order the depth-based order. Using the map χ, there are probably several possibilities to define the depth-based order such that it is total and adequate (see Definition 4.5). We can probably find a definition where we do not have to compute minima of suffixes of local configurations explicitly, but where we can just iterate over χ. We can implement χ in Ultimate as a TreeRelation.

Heizmann commented 4 years ago

The optimization 4b7a06995808403b91d8be427d953a6b1a496e48 reduced the runtime on the subTwoHourMarathon set to 52s. (Great! I had not expected that a reduction of the runtime by factor 2 is still possible.) CsvAutomataOperationStatistics1513124396_2019-12-19_23-08-43-238.csv.txt java.hprof.txt

Heizmann commented 4 years ago

Quick evaluation of bfbf37b after the bugfix def68b0. Runtime was reduced to 45s. CsvAutomataOperationStatistics1513124396_2019-12-22_00-03-22-204.csv.txt java.hprof.txt I compared the last two CSVs more carefully. The candidate preselection reduced the number of EXTENSION_CANDIDATES_TOTAL by around factor 10 and the number of EXTENSION_CANDIDATES_USELESS is close to 0. Most other numbers are unchanged but the NUMBER_EVENT_COMPARISONS is slightly different (difference is less than 1%). WithoutAndWithCandidatePreselectionInterleaved.csv.txt I do not have an explanation for this. Maybe events are added in a slightly different order.

naouarmehdi commented 4 years ago

B24: Based on the idea mentioned in B23:
- I implemented an improved version of EsparzaRoemerVoglerOrder: In this version we don't have to compute the surfixes of configuration to find the minimums of surfixes: to compute the minimum of the i'th surfix, we just have to iterate over the events of the local configuration and pick the events having satisfying the condition depth == i.
- I implemented a new (simpler) total and adequate order: DepthBasedOrder this order brought a bigger speed up than the improved version of EsparzaRoemerVoglerOrder.

Heizmann commented 4 years ago

Some obersevations of Mehdi.

B25: After an event was added (and until the next event was added) all successor conditions of this event definitely have the same co-related conditions.
B26: When we add an event is added we compute the co-relation information for its successors twice. Once for getting the SuccessorTransitionProviders and a second time in the evolveCandidate method (if the BUMBLEBEE_B07_OPTIMIZATION is used).
B27: A good compromise between using the BUMBLEBEE_B07_OPTIMIZATION and not using the BUMBLEBEE_B07_OPTIMIZATION is probably to compute all co-related conditions for the direct successors of the event but to check the remaining conditions (that can be used as instances for places) pairwise. The old pairwise check is inefficient anyway because we us the isCoSet method and since we instantiate places incrementally we do co-relation checks for which the result is already known.

Heizmann commented 4 years ago

B28: I added a set of benchmarks that were produced while verifying some SV-COMP programs. A file was written if the unfolding took more than 5s. I do not have plans for fundamental modifications of the verification algorithm so these are probably typical important unfolding problems for a long time. A quick analysis based on the hprof output and our statistics: java.hprof.txt CsvAutomataOperationStatistics1343838143_2019-12-25_03-46-02-549.csv.txt
- CQA1: Because we use Java streams the stack depth is very high. I cannot see the original caller in this hprof output and I have to guess the original caller and my guess might be wrong.
- CQA2: Iterating over events seems to be the major bottleneck. B26 and B27 (above) may bring a significant speedup.
- CQA3: The number of useless extension candidates varies from benchmark to benchmark. But overall, the number is much smaller than in the past.
- CQA4: CO_RELATION_MAX_DEGREE is nearly as high as NUMBER_CONDITIONS. Hence there is always a condition that is co-related to almost all other conditions.
- CQA5: The number of conditions per event varies approximately between 3 and 30.

Heizmann commented 4 years ago

B29: The Java VM is probably rather slow in dealing with recursive method calls. We should probably replace the recursive method by an iterative method.
B30: We could determine the order in which transitions are instantiated by heuristics. E.g., instantiate places with one remaining condition immediately. Store the number of co-related conditions. If number is high then check co-relation pairwise and compute all co-related conditions only otherwise.

Heizmann commented 4 years ago

Evaluation of the new (final) christmas benchmark set for the code of ab37154. Runtime 397s. CsvAutomataOperationStatistics781527109_2019-12-26_22-29-42-686.csv.txt java.hprof.txt

Heizmann commented 4 years ago

4336eb1 reduce the runtime on the Christmas benchmark set to 256s. CsvAutomataOperationStatistics781527109_2019-12-26_22-47-48-939.csv.txt java.hprof.txt Streaming co-related conditions seems to be the most time-consuming operation.

Heizmann commented 4 years ago

The ERV2 order would reduce the runtime on the Christmas benchmark set to 227s. CsvAutomataOperationStatistics781527109_2019-12-26_22-58-08-197.csv.txt java.hprof.txt

Heizmann commented 4 years ago

The DBO order would reduce the runtime on the Christmas benchmark set to 242s. CsvAutomataOperationStatistics781527109_2019-12-26_23-42-22-868.csv.txt java.hprof.txt

Heizmann commented 4 years ago

If we would not add cut-off events to the BranchingProcess this would reduce the runtime on the Christmas benchmark set to 131s. (!) CsvAutomataOperationStatistics781527109_2019-12-27_00-11-24-371.csv.txt java.hprof.txt However, before we enable this by default we have to discuss with @maul-esel if the large block encoding requires cut-off events (if this is the case we have to add a parameter to the PetriNetUnfolder, otherwise we will just change the value of the boolean flag).

naouarmehdi commented 4 years ago

B32: A side effect of B07 is that the update method of the correlation class has very high costs. (because of the multiple streamCoRelatedEvents calls). The main motivation for B07 was to compute the sets of correlated conditions which corresponds a specific place in an efficient way to use it in the "evolve candidate" method. Since 4336eb1 we check correlated conditions in the "evolve candidate" method only pairwise and don't need B07 anymore.
B33: As mentioned above we don't have to consider correlated events and conditions in the construction of the complete finite prefix. It would make sense to store correlated events and correlated non cutoff event separately to have the possibility to compute correlated non cutoff conditions efficiently.

Heizmann commented 4 years ago

The optimization mentioned in B32 reduced the runtime on the Christmas benchmark set from 256s to 65s. (I double checked that the runtime without the optimization is really 256s and the speedup is not caused by something else.) CsvAutomataOperationStatistics781527109_2020-01-04_02-32-39-141.csv.txt java.hprof.txt

This is (absolutely great but) a surprise to me. Maybe there is a problem with our implementation of the HashRelation3.

Heizmann commented 4 years ago

I added a testsuite 1a33ac41417d8a76c3308360c02d5568c1c3d653 for evaluating the impact of the here discussed improvements on our program verification. I am unsure how we should organize these informations (separate issue, here, something else). By now, I will just call related observations *Lyx Observations" and post them here.

L01: The major bottleneck for the program verification is related to Petri net unfoldings. When we look at the IncrementalLogWithBenchmarkResults, we can conclude this by the fact that the most time is spend on AutomataDifference but the time spend on Hoare triple checks and Predicate comparisons (both are done on demand while an unfolding drives forth the construction of the differences).
L02: The optimization mentioned in B32 reduced the runtime for the svcomp-Reach-32bit-Automizer_Default-noMmResRef-PN-NoLbe.epf settings only from 700s to 564s which hints that a huge (unnecessary?) bottleneck is hidden in our on-demand construction.
L03: Usually, the hprof output is spoiled by the threads that do the communication with the SMT solvers and only useful if we only use SMTInterpol (doable but has some disadvantages). Here, this is not the case and we can see expensive Petri net operations.
L04: Using the svcomp-Reach-32bit-Automizer_Default-noMmResRef-PN-NoLbe.epf settings the hprof output the most time is spend at IPetriNet.getSuccessorTransitionProviders(IPetriNet.java:95). This is related to B33, but I have to investigate further before I can draw conclusions. java.hprof.txt

Heizmann commented 4 years ago

L05: The Christmas benchmark set contains Petri net that are the result of the on-demand construction (PairwiseDifference) and Petri that are the result of RemoveUnreachable after the construction of the inverted difference. In 595fd35 I added a new benchmark set that contains Petri net that are the result of an inverted difference but before unreachable. An analysis of these Petri net showed that only around 10% of the transitions are reachable.

Heizmann commented 4 years ago

I added new benchmarks for the evaluation of the unfolding. These benchmarks mimic what is needed during program verification. The unfolding is not run by the operation FinitePrefix but by the operation DifferencePairwiseOnDemand, hence we have to use the DifferencePetriNwaBenchmark testsuite. Statistics about the unfolding are added to the statistics of the DifferencePairwiseOnDemand operation.

Heizmann commented 4 years ago

B34: There is another application for which we need that cut-off events are added to the co-relation, namely the computation of vital transitions. We call a transition vital if it occurs in some accepting firing sequence of the Petri net and the computation of vital transitions is e.g., needed by the RemoveDead operation.
L06: The plan for the program verification is to do only one unfolding per iteration (implementation nearly finished). We also want to remove dead (non-vital) transitions in each iteration, hence we always need cut-off events in the co-relation.

Heizmann commented 4 years ago

* B32: A side effect of B07 is that the update method of the correlation class has very high costs. (because of the multiple streamCoRelatedEvents calls). The main motivation for B07 was to compute the sets of correlated conditions which corresponds a specific place in an efficient way to use it in the "evolve candidate" method. Since [4336eb1](https://github.com/ultimate-pa/ultimate/commit/4336eb154623c8d73602d699b6c2e7f319a1a0ba) we check correlated conditions in the "evolve candidate" method only pairwise and don't need B07 anymore.

I agree. But I want to add that we do not need B07 because we do an on-demand construction of successor transitions. If we would know all predecessor places of an outgoing transition in advance, it might be useful to compute only co-related conditions that have a certain place.

Heizmann commented 4 years ago

B35: For the materialistic benchmark set, memory consumption matters. If I run DifferencePetriNwaBenchmark I get one OOM and several cases of high CPU load on several CPU cores because of the garbage collection of the JVM. Hence, I will use the JVM argument -Xmx16g in of the following experiments.

Heizmann commented 4 years ago

If we only use the optimizations that are yet enabled by default, the runtime of DifferencePetriNwaBenchmark on the materialistic benchmark is 5595s. CsvAutomataOperationStatistics1461347900_2020-02-18_01-58-39-527.csv.txt java.hprof.txt

Heizmann commented 4 years ago

The optimization mentioned in B32 reduced the runtime on the materialistic benchmark set from 5595s to 512s. CsvAutomataOperationStatistics1461347900_2020-02-18_01-58-39-527.csv.txt java.hprof.txt

Heizmann commented 4 years ago

If we additionally use the optimization of 81c6a12 the runtime is reduced to 451s. CsvAutomataOperationStatistics1461347900_2020-02-18_04-01-33-606.csv.txt java.hprof.txt

Heizmann commented 4 years ago

In 3b27a0c I added a new testsuite for evaluating our optimizations directly within the program verification. Ideas:

Use only benchmarks that are solved by most settings.
Use only benchmarks that are not solved in less than 5s.
Evaluate the speed, not the number of solved benchmarks (because hopefully all of these benchmarks are soon solved by all settings)

Rules:

We will not change the benchmarks.
We will use the setting svcomp-Reach-32bit-Automizer_Default-noMmResRef-PN-NoLbe.epf
We will use a timeout of 120s
We will use the runtime shown by Eclipse

Warnings:

The overall verification performance is affected by many algorithms in Ultimate

Heizmann commented 4 years ago

At moment ( 3b27a0c ) I the runtime is 496s. java.hprof.txt IncrementalLogWithBenchmarkResults_2020-02-25_02-41-33-839-incremental.log

Heizmann commented 4 years ago

B36: Neither the number of places nor the number of transitions is very suitable to measure the size of our Petri nets because in the program verification the number of incoming and outgoing arcs of transitions is increasing in each iteration. As an alternative, I propose that we use the size of the flow relation.

Heizmann commented 4 years ago

Same experiment as two comments before but this time with the new definition of size. Runtime was 497s. java.hprof.txt IncrementalLogWithBenchmarkResults_2020-02-25_03-50-47-316-incremental.log

Heizmann commented 4 years ago

Same experiment after 90f2a0cc5930582937fffd6e33bfb32ad7a64a42 and with mRemoveRedundantFlow set to true. Runtime was 558s. java.hprof.txt IncrementalLogWithBenchmarkResults_2020-02-26_02-31-25-428-incremental.log

Heizmann commented 4 years ago

Same experiment with backfolding (and a bunch of immature local changes that I needed for the integration) Runtime was 535. java.hprof.txt IncrementalLogWithBenchmarkResults_2020-02-26_04-53-50-235-incremental.log Interleaved Log: 1. default, 2. with RemoveRedundantFlow, 3. with Backfolding and RemoveRedundantFlow We see that the combination of Backfolding and RemoveRedundantFlow reduces the time that is needed for the difference operation. Next, I will improve the integration (e.g., reuse existing unfoldings) in order to improve the runtime of the additional operations.

Heizmann commented 4 years ago

b8661f94447577f9571590a76d767fcbe40cdfae reduced the runtime on the materialistic benchmark set to 369s (from 512s) java.hprof.txt CsvAutomataOperationStatistics1461347900_2020-02-29_16-23-21-647.csv.txt

.

Heizmann commented 4 years ago

b8661f9 reduced the runtime of the Svcomp20AutomizerConcurrentSpeedBenchmarks from 497 to 470s (given the fact that 3*120s=360s are spend on timeouts, this is a significant improvement.) java.hprof.txt IncrementalLogWithBenchmarkResults_2020-03-01_00-36-53-290-incremental.log

Heizmann commented 4 years ago

Some commit(s) of the last two weeks (probably the refactoring of conditions reduced the runtime on the materialistic benchmark set to 284s. java.hprof.txt CsvAutomataOperationStatistics1772471998_2020-03-14_04-40-04-036.csv.txt

ultimate-pa / ultimate

Slow Petri Net Unfolding (the Bumblebee Observations) #448