Don't aggregate collections twice

Kixiron commented 3 years ago

For relations that are both distinct and output they'll be aggregated twice, first at the distinct/threshold total and again at the consolidation. This can save non-trivial amounts of time (over 2 seconds for the citations benchmark) as well as removing an entire arrangement. Arguably the consolidation could be removed entirely, but we'd probably need a better output update scheme before that becomes viable without thrashing maps.

This input program displays the behavior

input relation Citation(c1: istring, c2: istring)
output relation CoCitation(c1: istring, c2: istring)

CoCitation(c1, c2) :- Citation(p, c1), Citation(p, c2), c1 < c2.

CoCitation will be aggregated twice, causing its runtime to nearly double as a result

ryzhyk commented 3 years ago

Are you positive that threshold_total subsumes consolidate? Semantically, they do different things. threshold_total makes sure that all weights in a relation are unit weights, whereas consolidate makes sure that each value appears exactly once in the set of changes for a given timestamp. Are you saying that threshold_total is guaranteed to produce consolidated output for each timestamp? It makes sense that the implementation works that way, but it does not follow from the mathematical definition of the operator.

mihaibudiu commented 3 years ago

That's exactly the kind of proof we should be able to make once we have a calculus.

namibj commented 3 years ago

Are you positive that threshold_total subsumes consolidate?

I am, and I extend that to all uses of reduce_core. This behavior comes courtesy of it's output being an arrangement.

Semantically, they do different things. threshold_total makes sure that all weights in a relation are unit weights, whereas consolidate makes sure that each value appears exactly once in the set of changes for a given timestamp.

Yes, semantically. But both threshold_semigroup (requiring TotalOrder) and reduce_core (coping with partially ordered timestamps) operate on batches. Batches, at least currently, span a closed interval of timestamps (while this gets a bit fuzzy for partially ordered timestamps, it applies to a natural extension based on using antichain frontiers). Also, when batches are build, updates are consolidated, as this greatly decreases downstream cost.

Are you saying that threshold_total is guaranteed to produce consolidated output for each timestamp? It makes sense that the implementation works that way, but it does not follow from the mathematical definition of the operator.

Yes, I am. And, as said, I extend that specific claim to all uses of threshold_semigroup, and furthermore to all uses of reduce_core.

ryzhyk commented 3 years ago

That makes sense, thank you! This should be a trivial optimization to implement, I'll create a PR soon.

vmware / differential-datalog

Don't aggregate collections twice #923