twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 706 forks source link

Add a pure counters API to fix issues with impure Stat based counters #1771

Closed johnynek closed 6 years ago

johnynek commented 6 years ago

This revives an idea we thought about 4 years ago:

https://github.com/twitter/scalding/pull/661

which was to have an explicit pure API for writing counters. The included API here has been used at Stripe and prove reliable and popular.

By contrast, the current API in scalding is impure and relies on a side-effect inside map/flatMap operations. We have found this to be unreliable since we have to have a hack to find the FlowProcess to report the counters to.

Since scalding 0.18 removes the TypedPipeFactory which we were using to wire this in, it has to become an intrinsic TypedPipe subclass for us to use it.

In our experience incrementCounters is the main method people use.

cc @avibryant @abuss-stripe @fwbrasil

avibryant commented 6 years ago

Very happy to see this. I'd say countBy is also quite convenient and well-used.

johnynek commented 6 years ago

@avibryant what do you think about renaming "count" to "increment" in these APIs to avoid confusion with KeyedListLike.count?

avibryant commented 6 years ago

I don't like incrementBy as a replacement for countBy because it makes you think the "by" refers to how much to increment, not which counter. I'm not sure what the right alternative would be; finding some naming scheme that is not confused with count would certainly be good. Maybe stats and statsBy etc?

non commented 6 years ago

I like "tally" as a term of art here.

johnynek commented 6 years ago

I'd love to merge this ASAP @fwbrasil since it minimizes our internal diff as we try to use develop (and cascading3 branch which as you know is blocking us).

non commented 6 years ago

👍

gkossakowski commented 6 years ago

✨ I've been looking for tallyBy a number of times in the api docs. Great to see it implemented!