tspurway / hustle

A column oriented, embarrassingly distributed relational event database.
Other
240 stars 36 forks source link

Wrap Hustle into the Disco concurrent pipeline processing model #45

Open ncloudioj opened 10 years ago

ncloudioj commented 10 years ago

The latest version of Disco supports to process multiple stages concurrently. Hustle could benefit from this new feature to speed up the query execution.

The tricky part is that in Hustle, every stage needs the input to be sorted, that means the stage is unable to run until all the output from previous stage is available and gets sorted. In some cases, sorting the input is unnecessary and wasteful, further, it prevents Disco from running the following stage concurrently as waiting the whole input to be available.

The following query is a typical one, which could be beneficial a lot from using Disco concurrent model,

select(h_sum(foo.cost), where=foo.date=="2014-06-02")