hydro-project / hydroflow

Hydro's low-level dataflow runtime
https://hydro.run/docs/hydroflow/
Apache License 2.0
449 stars 34 forks source link

Ops for scheduling ticks and for deferring dataflow across ticks #623

Closed MingweiSamuel closed 9 months ago

MingweiSamuel commented 1 year ago

Right now a cycle thru ticks (next_tick()) will cause the scheduler to spin as fast as possible as data cycles. We should try only starting the next tick if an actually external event happens. And see how much that helps and hurts

jhellerstein commented 1 year ago

Suggestion: next_tick(EAGER) vs next_tick(LAZY) forces the user to choose whether to eagerly schedule or wait for an external event.

jhellerstein commented 1 year ago

Cleaner: one operator is about scheduling (tick()), another is about deferring dataflow (defer()).

On scheduling, we have tick() (internally data-driven) and two sources: spin() (runtime-completion-driven) and source_interval() (wall-clock-time-driven). We should think holistically about this category of ops. Regular sources (source_stream(), etc) are externally data-driven).

tick() is kind of a sink/source combo ("boomerang" data to yourself across a tick boundary). E.g.:

source_stream() -> tick() -> map()

could be

source_stream(bar) -> dest_local(foo)
source_stream(foo) -> map()

which arguably makes it easier to see the ticking and its results in the middle of a big chain.