oracle-samples / clara-rules

Forward-chaining rules in Clojure(Script)
http://www.clara-rules.org
Apache License 2.0
1.2k stars 115 forks source link

Accumulators perform unnecessary retractions when right-(activate/retract) doesn't change the final result #182

Closed WilliamParker closed 8 years ago

WilliamParker commented 8 years ago

Currently, when there is a right-activation or right-retraction on an AccumulateNode or an AccumulateWithJoinFilterNode, for every token that matches the join bindings with the new element, the previous accumulator result is retracted and the new result is then re-propagated. [1, 2, 3, 4] This introduces inefficiencies in that downstream activity in the Rete network is triggered regardless of whether the accumulator result actually changed due to that right activation/retraction. At least in our case, the majority of accumulators choose one fact out a collection. Unfortunately, the pattern I'm seeing is in production data that can't be shared here, but from tracing it looks like we have a large number of right activations that aren't actually changing the result from accumulators.

Ideally, batch propagation of facts through the alpha nodes [5, 6] would minimize the number of times both accumulator nodes are right-activated, since they can be activated with multiple elements at a time. However, depending on the rule order scenario, it is possible to degrade to considering elements that are accumulated over one at a time, which is what appears to happen in this case. This could also come up if data was externally inserted (i.e. via [7]) with multiple insert calls; we had some test code that used one-at-a-time insertion of facts and had dramatically degraded performance as a result (we have since fixed this to insert all external input facts at once).

;;; That is the inefficient
(def facts [a b c])
(-> session (insert a) (insert b) (insert c))
;;; instead of the more efficient
(insert session a b c)

Some systems might not have all data upfront, marking the first approach in the example above impossible to implement.

Accumulators can of course have a wide variety of forms. However, in practice, at least in our case, the large majority are basically "take a collection of facts and choose one" accumulators. My suspicion is that such accumulators are a common use case for others as well (anyone with other use-cases and thoughts, please feel free to weigh in here). I propose changing both AccumulateNode and AccumulateWithJoinFilterNode to not send anything downstream if the change in elements did not change any final results.

Currently the flow is something like:

My suggestion is to make the flow something like:

Note several features of this:

I have a (very rough) working model of this change at https://github.com/WilliamParker/clara-rules/commit/7dfd2d357ac9a67963b5efc843a024be5fe9b82b

Some notes on this implementation:

(doseq [t tokens]
    (retract t))
(doseq [t tokens]
        (send t))

to one of

(doseq [t tokens]
     (retract t)
     (send t))

It ends up being easier this way due to the need to filter elements against tokens with an arbitrary to join function, but the previous order could be done too if it matters.

Against three cases that exhibited the behavior described earlier of accumulator retractions and propagations that didn't actually change the final result I got the following performance numbers:

Baseline (Clara 0.11) With not= check With not identical check
Case 1 39.2 seconds 7.5 seconds 7.8 seconds
Case 2 95.1 seconds 6.6 seconds 9.5 seconds
Case 3 55.7 seconds 10.5 seconds 26.6 seconds

What are your thoughts on the approach, desirability of this, etc.? It is somewhat specific to particular use cases, but ones that I know are very common for us and I would guess probably for others as well. Also, would you want something in clara-benchmark to demonstrate the problem? I don't see anything in there that would expose this currently.

  1. https://github.com/rbrush/clara-rules/blob/0.11.0/src/main/clojure/clara/rules/engine.cljc#L786
  2. https://github.com/rbrush/clara-rules/blob/0.11.0/src/main/clojure/clara/rules/engine.cljc#L831
  3. https://github.com/rbrush/clara-rules/blob/0.11.0/src/main/clojure/clara/rules/engine.cljc#L949
  4. https://github.com/rbrush/clara-rules/blob/0.11.0/src/main/clojure/clara/rules/engine.cljc#L1010
  5. https://github.com/rbrush/clara-rules/blob/0.11.0/src/main/clojure/clara/rules/compiler.clj#L1430
  6. https://github.com/rbrush/clara-rules/blob/0.11.0/src/main/clojure/clara/rules/engine.cljc#L199
  7. https://github.com/rbrush/clara-rules/blob/0.11.0/src/main/clojure/clara/rules/engine.cljc#L1153
  8. https://github.com/rbrush/clara-rules/blob/0.11.0/src/main/clojure/clara/rules/memory.cljc#L271
  9. https://github.com/clojure/clojure/blob/clojure-1.8.0/src/jvm/clojure/lang/APersistentMap.java#L52
mrrodriguez commented 8 years ago

Side note on this line:

Some systems might not have all data upfront, marking the first approach in the example above impossible to implement.

I think we actually could change this to make all external insert/retracts lazy and not be done until fire-rules is caused. However, that is a bigger change and not related to this issue. Just pointing it out for the sake of mentioning that we could be a handle scenarios like that in way that didn't mean consumers needing to change how they are inserting things.

WilliamParker commented 8 years ago

@mrrodriguez , we could do that, but I think I was a bit unclear. That would solve some problems, but I was also thinking of something like

(def app-state (atom empty-session))
(defn listener-to-something [fact]
     (swap! app-state (fn [s] (-> s (insert fact) (fire-rules)))))
;; Stuff that depends on app-state (say something like React)

Or, more broadly, of any case where the session state after each individual insertion call actually matters, as opposed to just being a question of what functions are called. In that case caching/laziness on insertions wouldn't help since fire-rules would be called after those insertions.

rbrush commented 8 years ago

This approach sounds good. The fact that the not= check will only fire in instances where we'd only propagate changes through the network anyway makes it seem safe, since the not= check will probably be dominated by cost of the Rete operations.

I think we could also do something lazy along @mrrodriguez's idea (and do so safely by making sure the laziness isn't visible to callers), but the changed proposed here seems safer and generally beneficially.

Adding a benchmark in clara-benchmark is a good idea. Ideally that library would grow into a good way to measure performance (and prevent perf regressions) over time.

WilliamParker commented 8 years ago

I've submitted a pull request for a benchmark at https://github.com/rbrush/clara-benchmark/pull/1

The current master Clara ran it in 2.3 seconds on my machine; with my working model for these changes it got down to 34 milliseconds. Drools took about 2.5 milliseconds.

mrrodriguez commented 8 years ago

@WilliamParker Just FYI. I've went through that pull request and have some comments and questions.

WilliamParker commented 8 years ago

@mrrodriguez I commented on the diff. Once we're on the same page regarding all the comment threads I'll make changes.

WilliamParker commented 8 years ago

I've looked at the comment that @mrrodriguez made on the benchmark pull request about wanting to build the ArrayList that wraps each fact to be inserted (so 100 ArrayList objects in total) outside the benchmarked code. There is something of a problem in that with the way the IBenchmark interface is written the Java type system is going to have a problem with the facts in

public void run(T session, Iterable<Object> facts) throws Exception;

each being an iterator that is inserted into [1], since they are just Objects. I see several possible resolutions to this:

A. Most obviously, I could just cast as necessary to make the new benchmark work. B. I could make the type of each thing in "facts" generic. The IBenchmark interface would be changed to something like [2]. C. I could add equivalent calls to the equivalent Drools benchmark to put them on equal footing.

I don't have a strong preference; since these benchmark objects are only called from Clojure anyway (and thus type guarantees don't add a lot of value here IMO) I slightly lean toward A. I don't mind implementing B though, I'd just like to run it by @rbrush first since it is a bit more invasive. C is similar to A in that it is a very quick solution, but of the two quick solutions I tend to lean toward A since C would still skew the ratio of the times if not their absolute difference.

  1. https://github.com/rbrush/clara-rules/blob/0.11.0/src/main/java/clara/rules/WorkingMemory.java#L23
  2. https://gist.github.com/WilliamParker/c93eef9cf4522b29f430c9445f7855da
WilliamParker commented 8 years ago

One more thought - the main downside I see with A (just casting Object to Iterable) is if we were trying to maintain type safety for something that isn't Clojure to interact with these IBenchmark instances.

rbrush commented 8 years ago

I think A is fine, at least for now. I'd be fine with B as well, but keeping things small and simple for this is appealing, it doesn't add any long-term complexity, and we can always revisit this in a subsequent change.

WilliamParker commented 8 years ago

I've updated the pull request with Option A (using a cast) implemented along with changes for the other comments. cc @mrrodriguez

mrrodriguez commented 8 years ago

+1 Looks fine to me now.

WilliamParker commented 8 years ago

I've been working on a pull request for Clara for this and a question of what the behavior around accumulated values of nil has come up.

The current behavior is as follows:

  • If an initial value is nil, do not propagate that initial value. [1]
  • If there are elements that reduce to an accumulated value of nil, propagate anyway. [2, 3]
  • However, if we later retract something that causes retraction (but not conversion) to nil, retract the previous result (possibly also nil) and don't propagate anything else. [4]

I created a test for this behavior at https://gist.github.com/WilliamParker/3858c7965138c808533cc2ec6b4e9b85

This also appears to be the case for the AccumulateWithJoinFilterNode implementation [5, 6] Note that if there are filtered facts, then there is a result of do-accumulate, even if a nil one.

This behavior is inconsistent in that the end result for any given set of elements and tokens depends on previous ordering of propagations and retractions. It is currently not easy to resolve this inconsistency for AccumulateNode since the node has no visibility to whether a retraction is logically a retraction to the initial value or a retraction to a state that reflects the presence of some remaining elements. It also seems incorrect to me to dispatch logic based on the accumulated or retracted value rather than the converted value, which is what downstream nodes will use.

My initial approach was to shift this to use the converted-return value but otherwise preserve the existing behavior. However, as @mrrodriguez pointed out on the commit and offline conversations, this complicates the code and is of dubious correctness anyway. Another option is to simply never propagate a value of nil (this could be done on either the accumulated/retracted value or the converted value, but I would favor the converted value). At this point I personally leaning toward not propagating nils. There is some downside though in that there could be use-cases where nil was a meaningful return value.

@rbrush , I was wondering what your thoughts on this are, as well as the originally intended semantics behind nil accumulated values. Is this something that is being used and that we need to support, or is the ability to propagate nil values in some cases just an unintended accident of the current engine implementation?

  1. https://github.com/rbrush/clara-rules/blob/0.11.1/src/main/clojure/clara/rules/engine.cljc#L719
  2. https://github.com/rbrush/clara-rules/blob/0.11.1/src/main/clojure/clara/rules/engine.cljc#L782
  3. https://github.com/rbrush/clara-rules/blob/0.11.1/src/main/clojure/clara/rules/engine.cljc#L683
  4. https://github.com/rbrush/clara-rules/blob/0.11.1/src/main/clojure/clara/rules/engine.cljc#L821
  5. https://github.com/rbrush/clara-rules/blob/0.11.1/src/main/clojure/clara/rules/engine.cljc#L1004
  6. https://github.com/rbrush/clara-rules/blob/0.11.1/src/main/clojure/clara/rules/engine.cljc#L833
rbrush commented 8 years ago

The current semantics around propagating nil are more of an accident than a design. Agreed we should address this, either always or never propagating nil.

Are there any valid use cases we can think of for propagating nil? In the (rare?) situation where this might be desired, I'd think that a user could just use some sort of :no-value keyword to explicitly indicate their intention here, which probably would make rules more readable in general.

WilliamParker commented 8 years ago

The only cases I can think of right now that wouldn't be rule-writing antipatterns anyway are

  • Something like getting the maximum value of a field (for example an ID field) and then binding that value to the result-binding, not the fact itself.

Say

(defrecord Order [id])
(defrule max-id-rule
     [?max-id <- (max-id) :from Order]
=> ...
)

Nil might be considered the "least id" and still be used. The accumulator could just handle that in the convert-return-fn easily enough with a :no-value keyword as you describe though.

rbrush commented 8 years ago

Fair points.

Thinking about this a bit more, and applying the Principle of Least Surprise, I'd imagine I'd be surprised as a new accumulator developer if I returned nil and it didn't propagate...and that surprise could lead to bugs. So perhaps I'm shifting to the "always propagate nil" camp just to avoid surprising behavior from the perspective of the user. Would you have a similar intuition here? If so, I think that strengthens the case to propagate nil.

We'd have to change how our initial-value is checked for consistency. This could be done by using a (contains? :initial-value accumulator-def) rather than a nil check. (We would probably have to switch the Accumulator definition to using a simple map rather than a record, since a record with the initial-value field would always have something there.)

mrrodriguez commented 8 years ago

I'm going to say I'm opposed to propagating nil entirely. It is confusing and counterintuitive (to me) to the idea that nil is "falsey" in Clojure. I've personally had situations like a NPE resulting from a nil being propagated when not expected before. It took me a while to understand what was going on at that time. In cases where you want to propagate something, but not "based" on the accumulated items, it seems fairly straightforward to have :convert-return-fn return a :no-value sort of marker.

The Clojure seq analogy is somewhat interesting, but I think it is just as easy in that rare case to make your :initial-value actually be an object, not nil. I consider this situation to be analogous to Clojure truthy/falsey sementics than to the semantics of what is "seqable" or not.

I've also had feedback from others around nil propagating and it never has seemed to be what they expect. There are also further implications with the current implementation of AccumulateNode where it doesn't actually know if the currently accumulated value + :convert-return-fn is the :intiial-value or not. So there will be ambiguity here in what it means.

Basically, propagating nil to me is just putting the rule propagation into the ambiguous case where nil is used to mean either nothing matched or things matched, but we want to still return nothing for that match. This is the common issue with nil/null and this is why we have patterns like the "null object pattern".

I'm worried about propagating more nils through. I think that may be a breaking change in many situations. Although, in the engine's current state, I think there is inconsistency in some cases on when nil may or may not propagate along. One way to fix this is to make the propagation decision fully dependent on the :convert-return-fns result, which we didn't have originally.

We can't really draw analogy here to Drools. Accumulators are structured differently. If you "convert return fn" of Drools returns nil it actually does nothing at all. Which can lead to weird truth maintenance issues. The solution to that, is to propagate a "no value" value instead of nil when you intend to propagate something that is "like" nil, but not nil.

stephenbrady commented 8 years ago

Weighing in here, I ran into this nil propagation issue as well. It's vey surprising, and frankly my take away on accumulators has been that they are fundamentally broken because of this behavior. In general, this gets into a wider confusion I've had about whether Clara is "edge triggered" or "level triggered". My take on this subject is that accumulators "coerce" edges into a level and therefore if there is no change in the accumulator's state (level), it should "swallow the edge" so to speak and trigger nothing.

rbrush commented 8 years ago

So I think there are two issues being discussed here:

  1. The original issue reported, that accumulators should only fire when the returned value changed. (I think this is what @stephenbrady is referring to.) I feel like there is consensus that accumulators should only propagate new values when the result of the accumulator changes. So the sum accumulator may change inputs, but if it adds up to the same number, we don't need re-send/propagate the identical result. Is this fair?
  2. We need to nail down the behavior of accumulators that could return nil. So what would this group expect to happen in this case? Propagate it, swallow it (and retract previous results?), or throw an exception? (I realize we might need to clean up these checks around the convert-return-fn logic...but for now I think we just need to figure out what the right behavior is.) Clara should aim to be completely unsurprising to users, so I'm looking for input on the approach that creates the fewest surprises.
WilliamParker commented 8 years ago
  1. Unless I'm missing something, the changes for this part (the original issue I logged) are completely invisible to the user. Our current behavior is just retracting and then propagating the exact same thing, which shouldn't have any impact on the final state of the session. It would impact listeners, in that some Rete operations wouldn't happen anymore and therefore wouldn't be logged, but making application logic depend on the Rete internals in that way is not something I see any reason to support. This part should be purely a performance optimization.

2a. This issue with nil came up because the logic to replicate the existing behavior adds some complexity to the prototype I have for (1). Setting aside whether suppressing nil is desirable, it would have the effect of simplifying these changes, which was my and I believe @mrrodriguez 's reason for bringing it up as part of this issue.

2b. Regarding what to do, I'm not completely certain either way and would want to think on it some more before committing to major changes. I can see some argument for both. I'm concerned though that up to now using a nil initial-value has been a common idiom for suppressing output, so making nil propagate by default would be a significant breaking change. (It would also be more work in Clara, but to me the external API is the more important part here).

2c. While it does add a modest amount of complication, simply preserving existing behavior in changes for (1) is a viable option. I'm leaning toward this option and limiting the scope creep here; even if I end up being the one making further accumulator changes for the issues in (2) there would be significant benefit IMO from getting our existing performance enhancements into the mainline in the near future so that further performance enhancements can be evaluated in the context of the entirety of other improvements rather than having various isolated feature branches. (From our perf numbers, I suspect that we will be spending some more time looking at performance enhancements after this issue and https://github.com/rbrush/clara-rules/issues/184 )

rbrush commented 8 years ago

I think the original issue here will have changes visible to the user if a rule right-hand-side has side effects. By eliminating propagation of unchanged values, we eliminate activations on the RHS that could take arbitrary action. I consider this to be an improvement (since the inputs did not change from the view of the RHS), but it is a behavior change.

I'm also torn here on the nil propagate. I had thought of propagating nils as following with Clojure conventions, but hearing @mrrodriguez mention the real bugs caused by this and the breaking implications are a big deal. So I'd support squashing nil propagation outright if you guys (who use this library in anger much more than I do) feel like that's best.

WilliamParker commented 8 years ago

@rbrush , I agree now that if the RHS had side effects there would be behavior changes. I wasn't thinking of that case, but if rules were controlled with salience and/or written in ways that avoided dependence on truth maintenance it would work. Frankly, I think running queries and detecting differences between successive iterations of firing the rules would probably be a less error-prone pattern, since any changes to truth maintenance or just fluctuations from run to run would potentially alter behavior unless rules were carefully constrained, but it is a behavior change.

For my part, I'd like to mull over the nil accumulator issue a bit.

mrrodriguez commented 8 years ago

I think relying on RHS side-effects to behave in a predictable way is going to be completely up the consumer to use salience to do so. We have these cases. With or without Will's change here, I don't think there is any reason to assume your side-effect is not going to be done any fixed count of times. The batch propagation in the engine has many possible ways that TMS could cause the batches to come at different times and the RHS could fire an unknown number of times in the absence of any user-defined priority/salience. It is a good note to have with this change though that the RHS could fire less etc. That is a good point.

On the nil propagation issue, and my past history dealing with mysterious nil propagations, I'm for the idea of never propagating nil after calling the :convert-return-fn. If you propagated some actual value at some interim point, and later the accumulator's convert retrun value is nil, then you do retract the previous work done, since the accumulator state has changed, but there is nothing to propagate as a replacement. So the rule/query LHS effectively goes from being fully-satsified and matched to no longer satisfied and matched.

Just given the feedback I've had from others and my own personal thoughts on it, this is what I'd most expect to happen. I agree I don't think we can make Clara completely unsurprising to everyone in every case though. (Unfortunately :) This sounds like a good enhancement to our docs and examples for Clara likely to help clarify these edge cases.

The other point I have here is speaking to Will's concern of scope creeping in on his optimizations. I'd be fine with making this nil propagation a separate issue. The only hesitation I had there was in worrying that these optimizations are trying to implement an inconsistently defined current behavior. So I've been trying to be confident they haven't made the situation worse etc. However, I haven't came across anything specifically wrong that wasn't "wrong" before.

WilliamParker commented 8 years ago

I gave this some thought overnight. I support not propagating nil and retracting the previous result if any when we retract to a nil converted result. My reasoning is as follows:

  • Thinking over conversations with people on our team, my experience is that people expect nil to not propagate and I think they tend to find propagation of nil very surprising.
  • My intuition is that while nil has some meaning of an empty collection under other circumstances (e.g. some Clojure core functions that operate on collections), its deeper meaning of the absence of anything is more consistent with saying that we have nothing to propagate.
  • Propagating nil initial values would be a major breaking change to the accumulator API, which currently reliably suppresses nil initial values.
  • The current behavior of sometimes propagating nil when items are present is not reliable (since retractions break it), so we wouldn't be breaking code that is currently reliable by suppressing such propagation.
  • If we do want accumulators to propagate nil, having the convert-return-fn return something, probably a keyword, denoting "no value but propagate" is a reasonable fix for a small number of use cases. If we have enough future use cases for this, it would be easy enough to add support for an optional key ":propagate-nil" to accumulator definitions to change this behavior for individual accumulators. I think we can hold off on something like this until we have concrete use cases to justify it though.

Thoughts?

mrrodriguez commented 8 years ago

+1 From me on that last comment.

rbrush commented 8 years ago

Alright, +1 for @WilliamParker's proposal as well. Thanks for considering this so thoroughly.

WilliamParker commented 8 years ago

I'll plan on making that change (to not propagate nil accumulator results) with this issue.

WilliamParker commented 8 years ago

I have submitted a pull request for this at https://github.com/rbrush/clara-rules/pull/191

Notes on the implementation are on the pull request.

WilliamParker commented 8 years ago

Second pull request with changes (after I accidentally closed the first one) at https://github.com/rbrush/clara-rules/pull/193

WilliamParker commented 8 years ago

I am closing this since the pull request has been merged and I don't think there is anything else to do here.