livepeer / LIPs

Livepeer Improvement Proposals
9 stars 13 forks source link

Multi-Node Support #14

Open j0sh opened 5 years ago

j0sh commented 5 years ago

Summary

Orchestrators may elect to run several concurrent instances of a node for disaster recovery, redundancy, load balancing, geo routing, etc. Such multi-node setups can introduce complex operational overhead, and currently poses a substantial slashing risk for incorrectly configured orchestrators. A few changes to the protocol can facilitate the use of multi-node orchestrator setups. Primarily, this involves incorporating an orchestrator-selected token into the broadcaster's segment signature, and the claims/verification changes to support that.

Problem Statement: Multi-Node Setups

There is a high potential for mis-configured orchestrators (transcoders) to be slashed in multi node setups. Multi-node setups are a useful feature for disaster recovery, redundancy, load balancing, or geo routing. Unfortunately, actually running such a multi-node setups is fraught right now, as described by the following approach:

Solutions

There are a few things that we can do on the smart-contract side help ease the orchestrator's coordination burden, and to minimize the number of "moving parts" an orchestrator is expected to maintain for a robust service.

Implement approved address lists

Approved address lists are not mandatory. While the orchestrator could copy Eth keys across nodes, copying keys should be a discouraged practice. Validation addresses would ensure orchestrators could submit transactions from separate nodes without copying keys, or routing transactions through a SPOF.

Add orchestrator-chosen token to the segment signature

Currently, the signature scheme looks like this: sign(H(streamId, segNo, dataHash)) .

The signature could be modified to this: sign(H(H(token), segNo, dataHash)) where token is a value selected by the orchestrator. The value should be distinct for the orchestrator node and the job; nodes belonging to the same orchestrator should not produce the same token for the same job. This value is produced and sent back to the broadcaster when processing the TranscoderInfo RPC call. The token is opaque to the broadcaster and protocol, which does not need to understand its value; as far as they are concerned, the token just a byte-string to use within the signature. There are a few benefits to using a token:

Several possible constructions for the token:

Note that the TranscoderInfo RPC call already contains a secret authToken that is meant to authenticate the broadcaster to the transcoder in a stateless fashion. This new 'signing token' would not be secret (it'd be revealed during claims), but it should be distinct enough for the orchestrator to associate a segment with an orch instance and job. One possible construction of the signing token would be to be a hash of the authToken , which is itself signed.

Note that the token can replace the StreamId in the segment sig, since the StreamId is there mostly for the benefit of the orchestrator's own checking. If produced appropriately, the token should be associated with the job ID / stream ID. However, continuing to use the StreamId in the hash won't hurt, either.

Claims and Verification

Add the token to the on-chain claim. Change the current claimWork to

function claimWork(uint256 _jobId, uint256[2] _segmentRange, uint256 tokenHash, bytes32 _claimRoot)

and store the tokenHash in the Claim struct as appropriate.

https://github.com/livepeer/protocol/blob/37da421d38a13313809c63b945953314eaaca455/contracts/jobs/JobsManager.sol#L278

From there, the protocol allows us to have overlapping claim ranges for the same job since each claim is indexed by a sequential claimId. Using the tokenHash that is stored in the claim, we can validate the broadcaster signature during verification.

https://github.com/livepeer/protocol/blob/37da421d38a13313809c63b945953314eaaca455/contracts/jobs/JobsManager.sol#L366-L372

Monitoring for Double Claims

Monitoring for double claims would also require incorporating the token in the check, and only triggering a slash if there was overlap in the segment ranges and the token. The token can be found within the claims stored on-chain, so this should not be a big change.

Other Attacks

These changes won't prevent all types of attacks on multi-node setups. Here are a few:

Double Dipping

There is still an issue of 'double dipping' whereby a broadcaster overdrafts its deposited Eth by submitting segments simultaneously to multiple orchestrators nodes, but that is less severe than an orchestrator being slashed. The double-dipping issue also exists across multiple jobs (with distinct orchestrators).

Minimum Segment Thresholds

Broadcasters could take advantage of an orchestrator's multiple nodes to abuse its policy for minimum segment thresholds. For example, if a transcoder had a policy to only submit claims containing a certain number of segments, and reject streams past a certain failure threshold. The broadcaster could submit a specially crafted sequence designed to maximize transcoded output, and repeating the exercise across the orchestrator's uncoordinated nodes.

For those reasons, a certain amount of coordination will probably be desirable for orchestrators in the long term. However, a lack of coordination shouldn't handicap orchestrators such that an operational problem results in them getting slashed, or forces downtime along with a complicated recovery. Not only does that adversely affect the Livepeer network's reliability, it makes the threshold for running a reliable transcoder that much higher.