twitter / summingbird

Streaming MapReduce with Scalding and Storm
https://twitter.com/summingbird
Apache License 2.0
2.14k stars 267 forks source link

Pass the semigroup to create Mergeable online #687

Closed johnynek closed 7 years ago

johnynek commented 7 years ago

closes #677

The issue is that it is easy to have the semigroup get out of sync between the implicit semigroup in the Producer graph, and the captured value for storm. This is not totally fixed, since when using a single store as a service and a store, we still need to capture the semigroup, but leftJoin does not have access to that. We could fix that walking the graph and finding the semigroup at the location of the sumByKey. We can do that in a later PR.

jnievelt commented 7 years ago

Looks good. I suppose we'll need a version bump for the MergeableStoreFactory signature change? At least it should be a trivial modification to make sources compatible.

johnynek commented 7 years ago

@jnievelt yes, this will be a minor version break.

pankajroark commented 7 years ago

Looks good overall, one minor comment.

pankajroark commented 7 years ago

@johnynek Could you cut a release before merging this? This will for sure require changes in our internal repository since there are custom implementations of MergeableStoreFactory that will need to change.

johnynek commented 7 years ago

@pankajroark I want to push back on cutting a release. The reason is this:

  1. I don't want to see us raise the bar to "any binary change should have a published release before merging"
  2. this change is: a. unlikely to hit many users I assume (who should have used the constructor functions in most cases) b. totally mechanical to port (just ignore the semigroup for the old behavior, or actually use it to get a safer operation).
  3. The work would be throw-away, since we would need to publish again (since we publish RCs from develop), and any work on a merge would need to be done against the new jar.
  4. We can always roll-back if there is some reason why this change is too onerous (but fixing code to be safer, seems to me like generally a win, not a cost). In fact, lots of the reason for custom MergeableStoreFactory is because you need to know exactly what semigroup is. With this change, that is no longer needed, so I imagine you can have maybe just one or two implementations throughout Twitter.

To turn it around: should Twitter employees publish new jars on each incompatible change for other users to see what the change will be like before a merge? I think that is too much work.

jnievelt commented 7 years ago

Actually I think there's just one place where we call MergeableStoreFactory#mergeableStore and it's in a test. The constructions (using from and apply) should be source compatible at least. I don't think it's a problem to merge before releasing.

pankajroark commented 7 years ago

Yeah, this should be ok. Overriding Mergeable is common, MergeableStoreFactory seems to be created mostly with the help of factory function. By bad in confusing the two.

On cutting a release on breaking changes, it was just an effort on my part to make my life easier. I'm planning to do a summingbird release soon and was hoping to do least amount of code changes in our repo, the last release turned out to be very complicated due to the MemoryPlatform changes. But it's not a big deal. In general I agree I wouldn't expect a release of summingbird on every breaking change.

johnynek commented 7 years ago

I have a +1 on this now, @pankajroark @jnievelt ?