hydro-project / hydroflow

Hydro's low-level dataflow runtime
https://hydro.run/docs/hydroflow/
Apache License 2.0
445 stars 33 forks source link

implement semi-naive rewrite #975

Open jhellerstein opened 6 months ago

jhellerstein commented 6 months ago

Here's one version for R \join S on streaming inputs.

This is essentially an implementation of the following Dedalus code:

R@t+1 :- R@t;
R@t+1 :- dR;
S@t+1 :- S@t;
S@t+1 :- dS;
output :- dR, dS;
output :- dR, S;
output :- R, dS;
---
title: Semi-Naive Rewrite of R join S.  By convention, R side is input 0, S side is input 1 in all joins.
---
%%{init:{'theme':'base','themeVariables':{'clusterBkg':'#ddd','clusterBorder':'#888'}}}%%
flowchart TD
dR[source_stream dR]
dS[source_stream dS]
R[Persist R]
S[Persist S]
J2[Join]
J3[Join]
J4[Join]
deferR[defer_tick]
deferS[defer_tick]
dST[Tee]
dRT[Tee]
dRT --> J4
dST -->J4
J4 --> U
dR --> dRT
dS --> dST
R --> J3
dST --> J3
dRT --> J2
S --> J2
J2 --> U
J3 --> U
dST --> deferS --> S
dRT --> deferR --> R
jhellerstein commented 5 months ago

Our existing pipeline hashjoin implements Pipeline Semi-Naive (PSN, Loo, '06) ... there is no wasted work in recursive (relational) joins. It would be worth double-checking that PSN works for multisets and general morphisms on lattices. If so it probably would also handle differential deletes.