hydro-project / hydroflow

Hydro's low-level dataflow runtime
https://hydro.run/docs/hydroflow/
Apache License 2.0
473 stars 33 forks source link

Decomposing Join #648

Closed jhellerstein closed 1 month ago

jhellerstein commented 1 year ago
flowchart TD
    optimize["auto-optimize (rewrite) <tt>persist()/deltae()</tt> #347"]
    joinstate["split up <tt>join()</tt> state into <tt>persist()</tt>, etc. #347"]
    joinlattice["lattice types in <tt>join()</tt> (~#271)"]
    joinopt["auto-optimize <tt>join()</tt> state"]

    optimize --> joinopt
    joinstate --> joinlattice --> joinopt

    %% david["David dedalus optimizations?"]
    %% rewrite_api --> david
    %% optimize --> david
zzlk commented 1 year ago

I think this issue hits at the overall question of like what are our edges? Currently they are essentially streams, they don't do deduping, they don't do re-ordering. So that means what does group_by/reduce look like? Currently because edges are just streams then it works with almost any closure you put in, but if we want to treat edges like sets or like lattices or like some other structure then that would greatly affect the implementation of join/groupby/reduce/many other operators.

shadaj commented 1 year ago

Here's a brief summary of one take that definitely requires more discussion: edge types should always be streams by default and lattice types should fall out of applying operators.

For example, putting a stream through a shuffle() operator would produce a new stream, but now effectively with bag semantics. If you have a downstream reduce function that is known to have an associative + commutative function, then we know a safe rewrite can transform that to shuffle() -> fold_ac(...), and the shuffle can then be pushed through other operators and used to perform the aggregation with partitions, etc.

On the types side, we can focus on checking for determinism. For example, it would be a compile error if a shuffle()d stream goes into a fold (it must go into a fold_c).

jhellerstein commented 10 months ago

See #929 -- same issue

MingweiSamuel commented 6 months ago

1050 #1058

MingweiSamuel commented 1 month ago

Closing this as we have lattice_bimorphism() operator - can make separate issues for improving that