Open mihaibudiu opened 2 years ago
Do you have a design for this? Even if it's bilinear, unless we have a way to index inputs for fast lookup of matching records, we will need to iterate over the entire trace of the input collections every time.
No, I don't have a design. I don't think that in general (for arbitrary predicates) you can do better than O(|A| \times |\Delta B| + ...). But that's still better than O(|A| \times |B| + ...), which is what the brute-force solution would do. If I think about it, this is what the incremental Cartesian product would do too, so perhaps this is not such a useful observation.
Predicates that rely on ordering of keys should be doable. Other types of join may require alternative indexing schemes.
Antijoins and semijoins would be really nice
Thought about non-equi joins some more. Indeed, range joins (match key k
in the left collection with a range of keys f_lower(k) < k' < f_upper(k)
in the right collection) can be implemented efficiently even using current data structures. This seems to be the most common kind of non-equi join people care about. So this is good news. The bad news is that this operator is harder to parallelize. I don't see a way around having a complete copy of one of the two collections available to all workers. So either the left or the right collection can be sharded, but then the other collection must be either replicated across all workers or be made available through a shared reference.
Antijoins and semijoins would be really nice
Yep. Semijoins can be constructed out of regular joins and negations, but a dedicated implementation can probably be better.
We may could shard by value instead of hash for that particular join, values 0..N
go to shard 1, N..N2
to shard 2, etc.
I think even regular hash based sharding would work just fine for one of the collections. Problem is, you cannot shard both at the same time. Sharding by range may not be a great design, e.g., if the values represent time.
I will close this issue, I think Cartesian products are the way to go. We can certainly build optimized implementations of other joins for specific predicates, but we should open more narrow issues for these.
Related to #217
We can definitely do better than Cartesian products. We already have the beginning of this in the range_join
operator, the main issue is that it's not yet parallelized.
To clarify, this is not about antijoins/semijoins, which are still forms of equi-joins. This issue is about joins where the join condition is, e.g., an inequality (>, <).
We don't have a way to express this currently as far as I can tell.
We don't have a way to express this currently as far as I can tell.
We do have limited range joins, and we can certainly do more. This is what this issue is about.
DBSP should be to handle incremental joins that do not use just field equality efficiently - they are still bilinear operators. The current API based on indexes does not allow expressing non-equi joins. As a side note, that something that DDlog probably can't implement efficiently.