mit-pdos / noria

Fast web applications through dynamic, partially-stateful dataflow
Apache License 2.0
4.97k stars 241 forks source link

Assertion failure in Join when ancestors in other domains #146

Open JustusAdam opened 4 years ago

JustusAdam commented 4 years ago

Setup

I am trying to run a query that computes an average. The graph and the operators are generated from a different language by a compiler, but in SQL it would look something like this

SELECT sum(x) / count(*)
FROM Tab

Error

The query itself runs fine, but I wanted to test how the performance would be if count(*) and sum(x) were computed on different domains. So I hacked into assignment to force these operators on their own domains.

When I do that however the join after the two calculations tries to access a non existent index in its right ancestor. I expanded the error message (see below) which says that the right ancestor with id 4 was short (tries accessing index 2 in the other slice, which only has two elements, in the generate_row function.

This is the error message for the two domains case, in the case of four domains its the same but the id is different (because more generated ingress/egress operators)

'right (4) was short', noria-server/dataflow/src/ops/join.rs:181:21

Questions

Is there something i am missing about domains? Can I not just make any operator into its own domain? Are there any invariants around what can go on a domain and what cant?

Runtime graphs

Here are the dot graphs for two domains and four domains and for good measure the original (working) singe domain.

The relevant operators here are ohua.generated/op_s_acc_0_0 (count(*)) and ohua.generated/op_s_acc_1_0 (sum(x)) and the join afterwards. (The rest is just generated code that does some column renaming)

How to reproduce

I uploaded a branch (join-after-domain-error-reproduction) to my fork that should contain the complete state of the system necessary (including generated operators) to reproduce the error.

In the udf-benchmarks directory run cargo run --bin features avg-split-domain/two-domainsf.toml

This will run the two domain scenario. For one or four use the one-domain.toml and four-domains.toml config respectively

jonhoo commented 4 years ago

I'm confused.. There's no join in the query you gave? The query looks like it'd hit the same issue as #137, no?

jonhoo commented 4 years ago

As to your question about domain assignment, you can move most operators into arbitrary domains, as long as you do so before you call migrate (because it adds a bunch of necessary internal operators at domain boundaries). Joins are "special" in that they always require that their inputs are materialized within the same domain as themselves, so moving them may not achieve the effect that you want.

JustusAdam commented 4 years ago

You are right. So basically I give it a different description of this query and it generates one similar to #137 but without the extra views for the two different aggregations.

Interesting. What do you think, would it work if I inserted just an Identity in between the join and its ancestors?

jonhoo commented 4 years ago

I think you'll have to specifically write the query such that the aggregations are done separately and then join them together, as in #137. I'm not sure what purpose the Identity would serve?

ms705 commented 4 years ago

The dot graphs for your multi-domain assignments look correct, and I would expect them to work. The error you get seems to indicate that you receive a record of incorrect length; are you sure that the Ohua-generated operators always produce the right output records?

@jonhoo My understanding (from looking at the graphs) is that @JustusAdam wrote the join-based version of the query (as per #137), and that he wants the aggregations to be in different domains for parallel processing. The join input materialization will use extra space (and some compute), but that's fine for his purpose.

@JustusAdam There's no need for an identity node, and it won't change anything -- the join merely forces the automatically-generated "ingress" node to be (partially) materialized, as indicated by the 3/4 symbol in the top right corner. If you added an identity node, that would get materialized instead.

JustusAdam commented 4 years ago

Ah, good to know.

I am fairly confident that it produces the right output, because the one-domain version works just fine. But I will run a trace over it anyway to figure out if it produces bogus output at any point.

JustusAdam commented 4 years ago

Also I am sorry for oversimplifying the query. Yes @ms705 is correct, I am generating the join-based query from #137.