Multi-Node Support - Githubissues

Summary

Orchestrators may elect to run several concurrent instances of a node for disaster recovery, redundancy, load balancing, geo routing, etc. Such multi-node setups can introduce complex operational overhead, and currently poses a substantial slashing risk for incorrectly configured orchestrators. A few changes to the protocol can facilitate the use of multi-node orchestrator setups. Primarily, this involves incorporating an orchestrator-selected token into the broadcaster's segment signature, and the claims/verification changes to support that.

Problem Statement: Multi-Node Setups

There is a high potential for mis-configured orchestrators (transcoders) to be slashed in multi node setups. Multi-node setups are a useful feature for disaster recovery, redundancy, load balancing, or geo routing. Unfortunately, actually running such a multi-node setups is fraught right now, as described by the following approach:

A ServiceURI set to a DNS hostname, which resolves to multiple IPs. There needs to be coordination between each orchestrator node at the point of TranscoderInfo. Otherwise, the following becomes possible:
1. Broadcaster queries the ServiceURI, receives the hostname, and resolves multiple IPs
2. Broadcaster sends TranscoderInfo requests to all nodes, and receives separate auth tokens from each.
3. Broadcaster submits the same segments to each node. Nodes do duplicate work and get slashed.
That coordination point essentially becomes a SPOF for the orchestrator. Ensuring maximum resiliency here can get deep and expensive. In other words, it would be good to avoid a requirement to run a zookeeper/etcd cluster or multi-master DB setup just to run a reliable orchestrator.

Solutions

There are a few things that we can do on the smart-contract side help ease the orchestrator's coordination burden, and to minimize the number of "moving parts" an orchestrator is expected to maintain for a robust service.

Implement approved address lists

Approved address lists are not mandatory. While the orchestrator could copy Eth keys across nodes, copying keys should be a discouraged practice. Validation addresses would ensure orchestrators could submit transactions from separate nodes without copying keys, or routing transactions through a SPOF.

Add orchestrator-chosen token to the segment signature

Currently, the signature scheme looks like this: sign(H(streamId, segNo, dataHash)) .

The signature could be modified to this: sign(H(H(token), segNo, dataHash)) where token is a value selected by the orchestrator. The value should be distinct for the orchestrator node and the job; nodes belonging to the same orchestrator should not produce the same token for the same job. This value is produced and sent back to the broadcaster when processing the TranscoderInfo RPC call. The token is opaque to the broadcaster and protocol, which does not need to understand its value; as far as they are concerned, the token just a byte-string to use within the signature. There are a few benefits to using a token:

In conjunction with other changes, would allow broadcasters to be charged appropriately for submitting duplicate segments to separate orchestrator nodes, given that each node produces a distinct token for the job.
Mitigates the replay attack resulting from reused StreamIDs for broadcaster designated signers, as long as the token is tied to the job. A transcoder should be verifying tokens and signatures corresponding to the jobs it knows about.
Implicitly, allows broadcasters to reuse sequence numbers for a given job. This would allow portability of broadcaster/signer keys without also transferring the associated job state. Not that transporting keys is a good idea, but a deeper level of job reuse might come in handy.
Increases protocol robustness against 'accidents' leading to repeated segment numbers (without those necessarily being duplicate segments).

Several possible constructions for the token:

Random nonce
Some concatenation of an orchestrator's internal "ID" , the job ID, broadcaster/signer address, etc
Hash of the RPC authToken
A signed byte-string of the above. Not strictly necessary, since any 'forgery' would imply manipulating the orchestrator's internal state such that it would still validate the token.

Note that the TranscoderInfo RPC call already contains a secret authToken that is meant to authenticate the broadcaster to the transcoder in a stateless fashion. This new 'signing token' would not be secret (it'd be revealed during claims), but it should be distinct enough for the orchestrator to associate a segment with an orch instance and job. One possible construction of the signing token would be to be a hash of the authToken , which is itself signed.

Note that the token can replace the StreamId in the segment sig, since the StreamId is there mostly for the benefit of the orchestrator's own checking. If produced appropriately, the token should be associated with the job ID / stream ID. However, continuing to use the StreamId in the hash won't hurt, either.

Claims and Verification

Add the token to the on-chain claim. Change the current claimWork to

function claimWork(uint256 _jobId, uint256[2] _segmentRange, uint256 tokenHash, bytes32 _claimRoot)

and store the tokenHash in the Claim struct as appropriate.

https://github.com/livepeer/protocol/blob/37da421d38a13313809c63b945953314eaaca455/contracts/jobs/JobsManager.sol#L278

From there, the protocol allows us to have overlapping claim ranges for the same job since each claim is indexed by a sequential claimId. Using the tokenHash that is stored in the claim, we can validate the broadcaster signature during verification.

https://github.com/livepeer/protocol/blob/37da421d38a13313809c63b945953314eaaca455/contracts/jobs/JobsManager.sol#L366-L372

Monitoring for Double Claims

Monitoring for double claims would also require incorporating the token in the check, and only triggering a slash if there was overlap in the segment ranges and the token. The token can be found within the claims stored on-chain, so this should not be a big change.

Other Attacks

These changes won't prevent all types of attacks on multi-node setups. Here are a few:

Double Dipping

There is still an issue of 'double dipping' whereby a broadcaster overdrafts its deposited Eth by submitting segments simultaneously to multiple orchestrators nodes, but that is less severe than an orchestrator being slashed. The double-dipping issue also exists across multiple jobs (with distinct orchestrators).

Minimum Segment Thresholds

Broadcasters could take advantage of an orchestrator's multiple nodes to abuse its policy for minimum segment thresholds. For example, if a transcoder had a policy to only submit claims containing a certain number of segments, and reject streams past a certain failure threshold. The broadcaster could submit a specially crafted sequence designed to maximize transcoded output, and repeating the exercise across the orchestrator's uncoordinated nodes.

For those reasons, a certain amount of coordination will probably be desirable for orchestrators in the long term. However, a lack of coordination shouldn't handicap orchestrators such that an operational problem results in them getting slashed, or forces downtime along with a complicated recovery. Not only does that adversely affect the Livepeer network's reliability, it makes the threshold for running a reliable transcoder that much higher.

livepeer / LIPs

Multi-Node Support #14