vmware-archive / database-stream-processor

Streaming and Incremental Computation Framework
Other
222 stars 19 forks source link

Non-equi joins #97

Open mihaibudiu opened 2 years ago

mihaibudiu commented 2 years ago

DBSP should be to handle incremental joins that do not use just field equality efficiently - they are still bilinear operators. The current API based on indexes does not allow expressing non-equi joins. As a side note, that something that DDlog probably can't implement efficiently.

ryzhyk commented 2 years ago

Do you have a design for this? Even if it's bilinear, unless we have a way to index inputs for fast lookup of matching records, we will need to iterate over the entire trace of the input collections every time.

mihaibudiu commented 2 years ago

No, I don't have a design. I don't think that in general (for arbitrary predicates) you can do better than O(|A| \times |\Delta B| + ...). But that's still better than O(|A| \times |B| + ...), which is what the brute-force solution would do. If I think about it, this is what the incremental Cartesian product would do too, so perhaps this is not such a useful observation.

ryzhyk commented 2 years ago

Predicates that rely on ordering of keys should be doable. Other types of join may require alternative indexing schemes.

Kixiron commented 2 years ago

Antijoins and semijoins would be really nice

ryzhyk commented 2 years ago

Thought about non-equi joins some more. Indeed, range joins (match key k in the left collection with a range of keys f_lower(k) < k' < f_upper(k) in the right collection) can be implemented efficiently even using current data structures. This seems to be the most common kind of non-equi join people care about. So this is good news. The bad news is that this operator is harder to parallelize. I don't see a way around having a complete copy of one of the two collections available to all workers. So either the left or the right collection can be sharded, but then the other collection must be either replicated across all workers or be made available through a shared reference.

ryzhyk commented 2 years ago

Antijoins and semijoins would be really nice

Yep. Semijoins can be constructed out of regular joins and negations, but a dedicated implementation can probably be better.

Kixiron commented 2 years ago

We may could shard by value instead of hash for that particular join, values 0..N go to shard 1, N..N2 to shard 2, etc.

ryzhyk commented 2 years ago

I think even regular hash based sharding would work just fine for one of the collections. Problem is, you cannot shard both at the same time. Sharding by range may not be a great design, e.g., if the values represent time.

mihaibudiu commented 1 year ago

I will close this issue, I think Cartesian products are the way to go. We can certainly build optimized implementations of other joins for specific predicates, but we should open more narrow issues for these.

mihaibudiu commented 1 year ago

Related to #217

ryzhyk commented 1 year ago

We can definitely do better than Cartesian products. We already have the beginning of this in the range_join operator, the main issue is that it's not yet parallelized.

To clarify, this is not about antijoins/semijoins, which are still forms of equi-joins. This issue is about joins where the join condition is, e.g., an inequality (>, <).

mihaibudiu commented 1 year ago

We don't have a way to express this currently as far as I can tell.

ryzhyk commented 1 year ago

We don't have a way to express this currently as far as I can tell.

We do have limited range joins, and we can certainly do more. This is what this issue is about.