twitter / summingbird

Streaming MapReduce with Scalding and Storm
https://twitter.com/summingbird
Apache License 2.0
2.14k stars 267 forks source link

Add a test for naming, and name stripping #635

Closed johnynek closed 8 years ago

johnynek commented 8 years ago

This is to address #633. It is not closing that issue, but it is a start.

Also, this exposes something that I think is a problem: the naming strategy seems pretty broken to me. I think names/options are pretty ill-specified. For instance, looking up options is done in the platforms. Scalding takes the first name that matches:

https://github.com/twitter/summingbird/blob/develop/summingbird-scalding/src/main/scala/com/twitter/summingbird/scalding/ScaldingPlatform.scala#L274

which, does not really make sense if two names match due to a fan-out situation.

The solution I would propose is this:

0) move the option resolving code to a common location and don't repeat that code on the platforms. 1) we only return the first name node we can reach on each branch. So, you can only have more than one name if there is a fan-out and you are named on each branch. 2) Each option has a rule for how to combine if a node has more than one name (fanout) and each of those names has an option set. 3) default only comes into play if there is no name set.

johnynek commented 8 years ago

here is the storm code for getting options:

https://github.com/twitter/summingbird/blob/develop/summingbird-storm/src/main/scala/com/twitter/summingbird/storm/StormPlatform.scala#L129

johnynek commented 8 years ago

added #636

johnynek commented 8 years ago

If it is too dangerous to change how options work for fear that we will break jobs, we should write very clear rules and a lot of tests for those rules, and link that documentation to def name in Producer.scala

johnynek commented 8 years ago

I'm getting a ton of diffs from scalariform, for some reason. I would have assumed we already hit all these files, but maybe not.

ianoc commented 8 years ago

Scalding takes the first name which specifies the option right? not the first name? What occurs in a branch scenario is a bit ill defined, but other than that i think the naming at the moment is reasonably well defined isn't it? -- maybe not the best naming strategy but it is at least defined pretty clearly.

Changing the options so only one name could effect a given node would be a large change, also its not clear to me how that would always work. Since users can name a given node more than once, but moreso, if i go .map.name.map.name how does that get resolved? since we will collapse both map's into one physical block of code. But the user doesn't control how we collapse them

johnynek commented 8 years ago

It seems the logic is the same in the code for both storm and scalding, isn't it?

Yes, I agree that changing the options is probably a non-starter. And I guess the naming is understandable, though fan-out is ill-defined. You can always side-step that by naming a node before a fanout.

The strange thing is that adding a option below will override the defaults for all nodes above it. That could be a little weird, but I guess people aren't too upset with it.

So, I would backstep: let's refactor to share more code (as this does) and add more tests (which this does) to be clearer about how functionality is working.

ianoc commented 8 years ago

Yes i think the logic is the same, or in at least its goal is the same on both platforms. Agree totally about consolidating the options handling and adding the extra tests as you've done here.

Changing the options handling seems like a bit of a non-starter I agree, though fundamentally I wonder if this winds up being very limiting in how one could do graph re-writing. For future generations of this sort of tech probably need to consider how options should be done instead.

ianoc commented 8 years ago

Regarding the scala formatting, are you getting it up from sbt or somewhere else? If i manually modify the scalding platform file and re-run it it puts it back to how it is on develop now. (looking at the whitespace changes on the getCommutativity method)

johnynek commented 8 years ago

Yeah, I'll check. Maybe I have some global plugin or something.

johnynek commented 8 years ago

@ianoc somehow it seems having one of:

addSbtPlugin("org.ensime" % "ensime-sbt" % "0.1.7")

addSbtPlugin("net.virtual-void" % "sbt-dependency-graph" % "0.7.5")

in my ~/.sbt/0.13/plugins/plugins.sbt file was resetting the scalariform options.

ianoc commented 8 years ago

Interesting, i have the dependency graph one in mine too, so i'd guess its the ensime one.

Looks like it depends on scalaireform @ https://github.com/ensime/ensime-sbt/blob/master/build.sbt#L25 (looks separate to that project just using it, they have it in their project/plugins too for that purpose). odd

ianoc commented 8 years ago

had a minor comment about documenting a return. Once thats fixed looks good to me, merge when green

johnynek commented 8 years ago

man our storm tests are flakey.

johnynek commented 8 years ago

I wonder if we are scooping up OOM errors into trys or futures somehow, and this is making some tests look flakey that are actually OOM errors on travis (at least some of the failures here were OOM, but the tests reported failed not due to exceptions).