twitter / summingbird

Streaming MapReduce with Scalding and Storm
https://twitter.com/summingbird
Apache License 2.0
2.14k stars 267 forks source link

Ability to add named nodes #712

Open pankajroark opened 7 years ago

pankajroark commented 7 years ago

Automatically generating good names is very hard. It would be great if summingbird allowed users to have some control over how Heron tasks are named. One way would be to allow using named functions i.e. accept functions with names in the dsl.

source .flatMap("name") { x => ... }

Then while generating task names we could use "name" instead of "FlatMap" that we use now. Similarly for other components.

johnynek commented 7 years ago

We already have .name. Why can't we use that mechanism?

The challenge in general is that the logical graph does not correspond to the physical graph in very obvious ways. On Fri, Jan 27, 2017 at 15:32 Pankaj Gupta notifications@github.com wrote:

Automatically generating good names is very hard. It would be great if summingbird allowed users to have some control over how Heron tasks are named. One way would be to allow using named functions i.e. accept functions with names in the dsl.

source .flatMap("name") { x => ... }

Then while generating task names we could use "name" instead of "FlatMap" that we use now. Similarly for other components.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/twitter/summingbird/issues/712, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEJdiAbsinMy1MT8-96lMXq4JAtn7Nhks5rWpqagaJpZM4LwYQY .

pankajroark commented 7 years ago

You're right if we were to simply replace "FlatMap" with "name" in above example for the entire topology then .name is the best choice. That was actually not what I had in mind I expressed incorrectly, let me elaborate.

What users want is to be able to identify nodes with what's executing on them with a name that makes sense to them. .name applies to the entire subgraph. I'll be happy with any good way of doing this but here's one:

We could keep the current naming scheme but wherever we find named functions we use them instead, overriding name of only the node where they are executed.

e.g. let's say a node was called Tail-FlatMap-Summer and was created using .sumByKey("aggregator")

then the node can be called aggregator. If we merge multiple methods together we can use a concatenated name. If it applies to multiple nodes like in the summer above then we do it based on the component, e.g. map side reduce component of sumByKey above can be called "mapsideAggregator".

Having two ways of naming i.e. .name and this scheme could be confusing and implementing above could be pretty complicated. But having bad names is a big issue for our users. Sometimes names get insane like: "Tail-FlatMap-Summer-FlatMap-Summer-FlatMap-Summer-FlatMap-Source" It's impossible for most users to keep such complicated names in their mind. Searching for viz graphs becomes hard. Unlike batch where tasks are short lived, for realtime each node is like a service and operators have to interact with it for debugging, frequently.

To be honest applying settings by .name is unintuitive to users. What they usually want is for the setting to be applied to just one node and not the entire subtree. We could instead use the above names for applying settings and keep the scope localized. To avoid confusion with .name we can require that only one of the naming mechanisms is used and not both.

johnynek commented 7 years ago

I think we should be able to use .name as an API. We just need improve the algorithm for mapping the name nodes into names for the physical nodes.

Alternatively, we could make a macro that captures line number and file for function literals and use the toString method of the functions to identify position.

We did something similar with scalding to add line numbers to the physical nodes. On Sat, Jan 28, 2017 at 01:58 Pankaj Gupta notifications@github.com wrote:

You're right if we were to simply replace "FlatMap" with "name" in above example for the entire topology then .name is the best choice. That was actually not what I had in mind I expressed incorrectly, let me elaborate.

What users want is to be able to identify nodes with what's executing on them with a name that makes sense to them. .name applies to the entire subgraph. I'll be happy with any good way of doing this but here's one:

We could keep the current naming scheme but wherever we find named functions we use them instead, overriding name of only the node where they are executed.

e.g. let's say a node was called Tail-FlatMap-Summer and was created using .sumByKey("aggregator")

then the node can be called aggregator. If we merge multiple methods together we can use a concatenated name. If it applies to multiple nodes like in the summer above then we do it based on the component, e.g. map side reduce component of sumByKey above can be called "mapsideAggregator".

Having two ways of naming i.e. .name and this scheme could be confusing and implementing above could be pretty complicated. But having bad names is a big issue for our users. Sometimes names get insane like: "Tail-FlatMap-Summer-FlatMap-Summer-FlatMap-Summer-FlatMap-Source" It's impossible for most users to keep such complicated names in their mind. Searching for viz graphs becomes hard. Unlike batch where tasks are short lived, for realtime each node is like a service and operators have to interact with it for debugging, frequently.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/twitter/summingbird/issues/712#issuecomment-275844152, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEJdoTHudvUtnSQBFt1MA1YCm3YgYmbks5rWy1igaJpZM4LwYQY .