twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.48k stars 703 forks source link

Use a null flow process when one cannot be found #1970

Closed navinvishy closed 2 years ago

navinvishy commented 2 years ago

The Beam runner for Scalding does not work with Hadoop counters(Stat). When Scalding jobs that use the Stat API are run using the Beam runner, they result in the following error:

Error in job deployment, the FlowProcess for unique id %s isn't available".format(uniqueId)

It looks like currently it is not possible for a runner to be able to provide its own implementation of a stat, because the implementation has a dependency on a Cascading FlowProcess. Here we return a NullFlowProcess when a flow process cannot be found in the flow mapping store, instead of erroring out. This has the effect of turning the stat call into a noop, since the NullFlowProcess does nothing on a call to increment counters.

Ideally, we would be able to plug in a Beam counter for Stat. The change I have here may not be ideal, but the goal is to discuss what could be done here, and to understand if returning a NullFlowProcess could have other unintended consequences.

CLAassistant commented 2 years ago

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Navin Viswanath seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

johnynek commented 2 years ago

looks like we had test coverage for this, so you need to fix a test.