uber-go / cadence-client

Framework for authoring workflows and activities running on top of the Cadence orchestration engine.
https://cadenceworkflow.io
MIT License
345 stars 131 forks source link

Add a random UUID to worker identities, to prevent collisions #1135

Closed Groxx closed 2 years ago

Groxx commented 2 years ago

The fallback worker identity of {pid}@{host}@{tasklist} is insufficient in practice. E.g. within Uber, our docker setup can somewhat regularly lead to multiple instances of a worker ending up on the same physical host (which is not changed by docker) and with the same PID (as all the docker containers start up the same way -> all service processes get the same PIDs). Other causes are possible too, e.g. if a worker crashes and is restarted it could share the same host+pid, even though its caches (and anything else we actually care about for the identity) are lost.

At the least-problematic-but-confusing level, collisions can lead to a misleadingly-short list of workers on tasklists in the web UI, as the list does not show duplicates.

At the most-problematic level, this can lead to uneven load and sub-par request routing, as only the first/last/??? request to poll the server will receive data intended for a specific worker.

Ideally that wouldn't be an issue, e.g. the server could keep better track of polls and route more precisely, but 1) it's difficult to reliably identify polls since the server only has the poller-provided data (which mis-identifies itself), and 2) even if it can be fixed on the server, improving identity uniqueness helps both fix this now and reduces or eliminates the impact of any future identity confusions.


Alternatively, we could add a process-global random UUID that auto-inits, which would give us IDs that can be correlated across tasklists. While I think this is acceptable as well, I'm loathe to add another global variable, and the benefit of cross-task identity correlating is pretty minor or nonexistent.

If someone actually cares about that, they can pass an explicit ID that they generate however they like.

coveralls commented 2 years ago

Pull Request Test Coverage Report for Build 02177b60-2aba-44a2-aaae-0141bf4c15bb


Files with Coverage Reduction New Missed Lines %
internal/internal_task_pollers.go 4 80.45%
<!-- Total: 4 -->
Totals Coverage Status
Change from base Build 03b9c258-3f56-4ec6-8208-23871bd8eaca: 0.05%
Covered Lines: 12180
Relevant Lines: 17010

💛 - Coveralls