livepeer / go-livepeer

Official Go implementation of the Livepeer protocol
http://livepeer.org
MIT License
538 stars 169 forks source link

TicketParams expiration time #1343

Closed kyriediculous closed 4 years ago

kyriediculous commented 4 years ago

TicketParams Expiration


name: TicketParams expiration about: Add an expiration block number to TicketParams author: @kyriediculous (based on an original idea form @yondonfu )


Abstract

OrchestratorInfo is the data structure that is communicated over the wire from an Orchestrator to a Broadcaster. It contains two fields important for payments to work: pricing information and ticket parameters for the probabilistic micropayments protocol.

// The orchestrator sends this in response to `GetOrchestrator`, containing
// miscellaneous data related to the job.
message OrchestratorInfo {

  // URI of the transcoder to use for submitting segments.
  string transcoder  = 1;

  // Parameters for probabilistic micropayment tickets
  TicketParams ticket_params = 2;

  // Price Info containing the price per pixel to transcode
  PriceInfo price_info = 3;

  // Orchestrator returns info about own input object storage, if it wants it to be used.
  repeated OSInfo storage = 32;
}

[1] go_livepeer/net/lp_rpc.proto

Due to the client side implementation of the payment protocol, this information is subject to regular change: to avoid dealing with gas price fluctuations on the Ethereum network for redeeming winning payments a transaction overhead is charged.

Other reasons for change of TicketParams would be a winning ticket redemption on-chain or manual change in the node configuration.

When OrchestratorInfo changes mid-stream a Broadcaster is unaware that the old parameters will no longer be honoured by an Orchestrator resulting in a payment error and interruption of transcoding. The Orchestrator currently offers a grace period, but that could prove to be a game-able mechanic allowing a Broadcaster to send any TicketParams while in the grace period.

We can remove the current grace period implementation and instead introduce the concept of an expiration time on TicketParams. This would:

Motivation

The feature should increase transcoding reliability as payment errors should no longer occur during a happy flow as well as removes the issue the number of concurrent streams from a broadcaster exceeding the maximum acceptable error count. The expiration scheme is an improvement over the current implementation of the grace period which introduces several issues of its own:

  1. The current grace period is set at 3 acceptable payment errors. When a Broadcaster sends in more concurrent streams than this it will inevitably run over the error count when OrchestratorInfo changes even when neither party deviates from the protocol [2].

  2. The broadcaster is unaware if a particular set of OrchestratorInfo will be honoured by the Orchestrator it is working with while still sending a payment along with it's transcode request. This is can be exploited by Orchestrators by setting large negative balances for any broadcaster and high ticket winning probability, allowing to potentially redeem winning tickets without doing work.

  3. The grace period creates a new attack vector on the Orchestrator whereby a Broadcaster can get free transcoding:

    • The grace period is tracked by ManifestID, which is picked by the Broadcaster. The Broadcaster could change this value every three segments and never run into the maximum error count.
    • The Broadcaster can wait for a change in gas price, which resets the error count, to send new transcode requests with invalid payments.

Given this set of known issues there's two options:

  1. Look for a better solution
  2. Find ways to mitigate the risks so they are no longer statistically relevant

Rather than trying to limit the attack surface of the grace period mechanism which would involve adding more complexity to an already complex mechanism we can spend slightly more effort building a more robust solution.

The expiration time mechanism is unaffected by the number of concurrent streams a Broadcaster operates, allowing it to immediately scale without additional implementation considerations.

A simpler solution will also make the pm package in go-livepeer more maintainable and accessible to new engineers.

The benefits mentioned above would outweigh the downside of the extra engineering effort created by stripping out the grace period logic completely as well as having to add a few additional bytes to transcoding responses sent over the wire witch each segment by having an Orchestrator attach it's current TicketParams with each response.

[2] https://github.com/livepeer/go-livepeer/issues/1075

Proposed Solution

From a high-level the proposed solution would look as follows:

A new field is added to the OrchestratorInfo message ExpirationTime this will contain a timestamp until which the current parameters will be valid.

The Broadcaster can check whether the OrchestratorInfo contains a valid ExpirationTime during payment generation when creating its transcode request. At this point in the process we also validate the price and ticket parameters already in the current implementation of go-livepeer. In case OrchestratorInfo is past expiration when generating a payment, the Broadcaster will request the latest OrchestratorInfo from the Orchestrator. If no valid OrchestratorInfo is provided upon this refresh, the Orchestrator will be removed from the working set.

The Orchestrator can keep track of previously advertised sets of OrchestratorInfo that have not expired and cross-check received tickets and their expiration time to see whether it should accept or reject an inbound transcoding request. It can do this by binding ticket parameters to the recipientRand (HMAC) generation.

To allow the Broadcaster to always have access to a non-expired set of OrchestratorInfo, the data should be sent back by the Orchestrator in each response to a transcode request.

Implementation Tasks and Considerations

Global clock

We'll need a logical clock to give participants in the system a global synchronised view of time. We can use the block numbers of blocks in the Ethereum network to indicate expiration time.

If we use block number instead of the block timestamps we don't need to alter the current eth/blockwatch package (it only saves the block number) or create an additional service that queries a remote Ethereum node for the block timestamp when a new block is found.

Since the timestamp is ambiguous with the block number (block timestamps are part of the block validation to make sure they are within certain bounds) there should be no difference.

Extending OrchestratorInfo

We can simply extend the current OrchestratorInfo protobuf message with a new field ExpirationBlock with an ascending index. Broadcaster nodes that don't upgrade their node software will ignore the field.

message OrchestratorInfo {
...
bytes expiration_block = 4;

}

Analogously we can adjust pm.TicketParams to include both ExpirationBlock *big.Int and PricePerPixel *big.Rat so we have access to these values inside the sender and recipient.

Default Expiration Time

A set of TicketParams T would have to remain valid until the next advertised set of TicketParams T+1 is able to be received by the Broadcaster so that the Broadcaster is given ample time to update without disrupting transcoding due to minor latency discrepancies.

Since we'll be using block numbers the expiration time will be at least the block time of the Ethereum network (current ~12 seconds).

Given that an Orchestrator needs to respond within 8 seconds anyway that would put the maximum possible time required for a Broadcaster to receive the updated TicketParams at blockTime + 8 seconds.

Using block numbers avoids fluctuations in block time and the expiration time would have to be 2 blocks. As a result TicketParams generated at block height N will be valid up until (but not included) block height N+2.

TicketParams validation

An orchestrator needs to make sure that when receiving a payment it is with parameters that it has previously advertised.

To do this we can incorporate the ticket parameters when generating the recipientRand (as an HMAC).

recipientRand = HMAC_SHA256(secret || seed || sender || faceValue || winProb || expirationBlock || price)

By binding the ticket params to the HMAC we can check that recipientRandHash from a received ticket is the same as one previously advertised: the Orchestrator reconstructs the recipientRand from the parameters provided with the ticket and checks it against the recipientRandHash on the ticket. Since the HMAC is generated with a secret only the Orchestrator knows, it is impossible for a Broadcaster to forge ticket parameters.

Reworking RoundsWatcher into a more generic TimeWatcher

The last seen block numbers needs to be read in three different places:

The easiest path to achieve this is to extend the functionalities of the roundsWatcher, which currently watches for the latest round number and block hash of the block it was initialised in, to keep track of the last seen block number as well. Since the roundsWatcher already subscribes to the blockwatcher this can be as simple as keeping track of a single additional state variable.

Alternatively we could use the DB to query this information since it is already stored there, but this would require additional dependency injection as well as an alternative polling mechanism for the ticketQueue rather than relying on a channel (see Delay Ticket Redemption until TicketParams Expiration)

Delay Ticket Redemption Until TicketParams Expiration

Currently winning tickets are redeemed on-chain instantly when possible upon receiving them, revealing the current recipientRand value in the commit-reveal scheme.

This opens up a new attack vector in the case of the TicketParams not yet being expired. The Orchestrator would still honour the revealed recipientRand but it would be possible for a Broadcaster to grind out ticket.senderNonces that would result in signatures over the ticket that result in non-winning tickets only.

KECCAK256(sig, recipientRand) < winProb

Within the current version of the payment protocol which uses ECDSA signatures and a commit-reveal scheme the easiest solution to implement right now would be to delay winning ticket validation until after expiration of the TicketParams that constitute the ticket.

We could leverage the existing ticketQueue for this. Caveat here is that tickets would be stored and retried on a per-sender basis despite ticket expiration being global across senders. This should be ok however.

ExpirationBlock can be added to pm.TicketParams as well as pm.Ticket (when using pm.NewTicket() and batch.Ticket()), giving access to a ticket's expiration block in the ticketQueue.

Instead of trying to redeem tickets in the queue when the maxFloat for a sender changes, we can run the loop upon discovery of a new block. To achieve this we can create a subscription on TimeWatcher that allows us to pass in a sink channel (chan<- *big.Int) on which to receive updates.

Having tickets not being redeemed immediately could lead to issues at the end of rounds where the Orchestrator will be removed from the active set in the coming round. Given however that active set churn is very low momentarily and there's other end-of round improvements that can be made for the payment scheme it seems reasonable to set that concern aside for the time being.

The longer term solution anyway would be to remove interactivity from the winning ticket selection protocol by moving from deterministic signatures to a non-interactive protocol such as Verifiable Random Functions. The unknown is whether the cost of a VRF implementation to be executed on the EVM is warranted.

Refresh TicketParams when failing over

It is possible that whenever the Broadcaster needs to fail over and select a new Orchestrator that the OrchestratorInfo for the new Orchestrator has expired if we haven't refreshed the session recently.

Consider the following scenario:

  1. B has orchestrators O1, O2 in its active set
  2. O1 is performing well, so B continues using it
  3. O2 ticket expires
  4. O1 has a quick, transient issue that causes transcoding to take longer than realtime for one particular segment.
  5. B tries to fall back to O2. Ticket has expired, so B drops O2 from the working set. No other orchestrators available, so B also drops the segment.

The quick solution here would be to just retry submitting the segment once and directly querying the selected Orchestrator for updated OrchestratorInfo. The latency added by this request should only be a couple hundred milliseconds at the maximum, otherwise the Orchestrator will likely be slow to respond with transcoded results as well.

In case the latency is found to be excessive in the first iteration of testing we can move to a background service that updates OrchestratorInfo for each orchestrator in the working set.

Testing Tasks and Considerations

Known Unknowns

Alternatives

Additional Context

Original discussion

yondonfu commented 4 years ago

We can simply extend the current TicketParams protobuf message with a new field

If the expiration block also applies to the price then it might make sense to add the field to OrchestratorInfo i.e.

message OrchestratorInfo {
    ...
    int64 payment_params_expiration_block;
}

A set of TicketParams T would have to remain valid until the next advertised set of TicketParams T+1 is able to be received by the Broadcaster so that the Broadcaster is given ample time to update without disrupting transcoding due to minor latency discrepancies.

As we discussed on our call, another factor to consider in setting the expiration time is the lack of synchronization of the view of the latest block for B & O - the two parties may not always have the same view of the latest block. An expiration time of >= 2 blocks should help with this.

To keep track of previously advertised ticket parameters we can create a mapping on the recipient to replace the current recipient.invalidRands map.

As mentioned on our call, we don't need to do any additional bookkeeping as long as we include the expirationBlock as an input to the HMAC along with the ticket faceValue and winProb. The recipientRand generated by O will not be the pre-image for the recipientRandHash included in the ticket sent by B if O did not previously advertise the faceValue and winProb of the ticket with the provided expirationBlock.

Not including PriceInfo in this validation scheme could be fine as accounting should make sure an Orchestrator is paid fairly.

As we discussed on our call, we do need to validate the price or else the following scenario is possible:

  1. B receives price A from O
  2. O debits transcoding fee from B's balance with price B
  3. B debits transcoding fee from its balance with price A

Including the price as an input in the HMAC allows O to validate that it previously advertised the price and that it is not yet expired. B would expect O to accept the price included in a payment as long as the price is not yet expired.

Within the current version of the payment protocol which uses deterministic signatures

Slight correction: The current PM protocol actually does not use deterministic signatures since the ECDSA scheme used by the node is a randomized signature scheme. An example of a deterministic signature scheme would be RSA.

Instead of trying to redeem tickets in the queue when the maxFloat for a sender changes, we can run the loop upon discovery of a new block. A check can be added where we check for sufficient max float from the sender to redeem the ticket, to also check its expiration.

So this would turn the queue from a "retry queue" where we only add tickets if they need to be retried into a general "redemption queue" where we add all tickets that should be redeemed. I think this should work. The queue will need a way to access the max float (which is currently stored in the SenderMonitor) every time a new block is received. At the moment, the queue only receives the max float value when the max float changes as signaled by the SenderMonitor.

Tracking accounting balances per Broadcaster address instead of Manifest ID is still an open issue and area of discussion. To ensure sufficient safeguarding by the accounting mechanism it could be included in the scope of this project.

I think this can be considered separately from this project.

kyriediculous commented 4 years ago

If the expiration block also applies to the price then it might make sense to add the field to OrchestratorInfo i.e.

Correct, this was a consideration made due to the section about excluding price from the equasion

the two parties may not always have the same view of the latest block. An expiration time of >= 2 blocks should help with this.

A blockPolling interval smaller than 1/2 the blocktime could also help with synchronization issues, the faster we retry when missing a block the faster the views are in sync. The default is currently at 5 seconds, so I think this should be fine overall.

Increasing the expiration time any further will impact the transaction cost overhead assigned to the ticket at creation time and the actual transaction cost at redemption time. We have to consider a good balance her and I think creationBlock + 1 <= lastSeenBlock constituting non-expired would be a good fit as discussed.

The queue will need a way to access the max float (which is currently stored in the SenderMonitor) every time a new block is received. At the moment, the queue only receives the max float value when the max float changes as signaled by the SenderMonitor.

Would it though? I see two options here.

  1. remove the maxFloat check in the ticketQueue consume loop when trying to send a ticket into the redeem channel , we do this check in recipient.redeemWinningTicket anyway and the ticket is re-added to the queue if the float isn't sufficient.

  2. call maxFloat and signal it to the queue before adding a ticket to the queue.

I like option 1 better but we do change the purpose of the queue as you mentioned. It also removes the need to do any additional state tracking on the ticketQueue.

Example:

I think this can be considered separately from this project.

Gotcha

The following is a copy/paste from discord but it's a consideration I'm having currently as well:

Currently whenever we encounter a recipient.ReceiveTicket() error in orch.ProcessPayment() we will:

For bullet point 2, this case no longer exists, after removing acceptable/unacceptable errors if recipient.ReceiveTicket() returns an error, the ticket will never be winning either. Previously we would check in recipient.acceptTicket() whether there would be an acceptable error.

For bullet point 1 however there is the question then if we should still continue with recipient.ReceiveTicket() for each ticket in the payment if the first one returns an error. The only case for which this error wouldn't be caused by faulty TicketParams (I think) is the signature verification. For all other cases we might as well stop trying to receive tickets. But then again I wonder what is the case where we have faulty signatures for only a couple of tickets in the batch assuming nodes are following protocol.

yondonfu commented 4 years ago

remove the maxFloat check in the ticketQueue consume loop when trying to send a ticket into the redeem channel , we do this check in recipient.redeemWinningTicket anyway and the ticket is re-added to the queue if the float isn't sufficient.

Good point. This sounds good to me.

For bullet point 2, this case no longer exists

We can still receive a winning ticket even if an error is returned if we check the param expiration after running the basic ticket validity checks (via Validator.ValidateTicket()). As long as the basic ticket validity checks pass, the ticket can still be redeemable on-chain if it wins. So, it could make sense to do the param expiration check in ReceiveTicket() after the basic ticket validity checks to make sure that we always redeem any winning tickets received. If we do this, then the caller of ReceiveTicket() would still do something like:

_, won, err := recipient.ReceiveTicket(...)
if won {
    recipient.RedeemWinningTicket(...)
}
if err != nil {
    // Handle error
}

For bullet point 1 however there is the question then if we should still continue with recipient.ReceiveTicket() for each ticket in the payment if the first one returns an error. The only case for which this error wouldn't be caused by faulty TicketParams (I think) is the signature verification.

While I agree that it seems unlikely that in a batch of tickets one signature could be invalid while the others are valid, since its not impossible it seems reasonable to continue validating the other signatures.

The main scenario for continuing through the rest of the tickets in a batch when hitting an error is when the error is due to params expiration and the reason for continuing is to check if any of the tickets won.

The one scenario where it might make sense to not continue with the rest of the tickets in a batch is if the the basic ticket validity checks in Validator.ValidateTicket() besides the signature check (for the reason mentioned above - the other non-signature checks are based off of param values shared across all tickets so if a single ticket fails one of these checks the rest would as well) fail. Handling this scenario would require either checking for a specific error returned by Validator.ValidateTicket() to determine whether we should stop receiving tickets early or refactoring Recipient and Validator to check the shared param values of tickets first and then loop through the unique param values (i.e. sender nonce + signature) of tickets after. I think the second option would be preferred, but I don't think implementing that option to handle this scenario is urgent right now.

j0sh commented 4 years ago

One comment: extremely short ticket validity periods (eg, 2 blocks or ~30 seconds) are likely to make reliability worse, absent further engineering effort. Consider the following scenario:

  1. B has orchestrators O1, O2 in its active set
  2. O1 is performing well, so B continues using it
  3. O2 ticket expires
  4. O1 has a quick, transient issue that causes transcoding to take longer than realtime for one particular segment.
  5. B tries to fall back to O2. Ticket has expired, so B drops O2 from the working set. No other orchestrators available, so B also drops the segment.

One way to mitigate this would be for the B to force a refresh for every expiration. But that introduces a level of chattiness into the networking protocol that I'm a bit uncomfortable with. Another option is to "rotate" among Os in the active set, but B loses the ability to minimize latency.

kyriediculous commented 4 years ago

@j0sh very good point whenever we switch orchestrators it's actually very likely that we received expired TicketParams since we only refresh when we don't have enough sessions.

We could force a refresh for the selected orchestrator only once if the ticketparams are expired. I'm not sure if that's a reasonable tradeoff to make here. The time for the request could be considered additional "warm-up time" for future selection algo updates since after that the B should just get updated params with each segment.

kyriediculous commented 4 years ago

I think the second option would be preferred, but I don't think implementing that option to handle this scenario is urgent right now.

I would actually prefer the first option since we already import pm into core/orchestrator.go anyway, so we could return a pm.ErrInvalidTicketSignature (I don't think we need an error type to assert upchain here?). I think it might make sense to just include it given ticket batch sizes on mainnet.

So, it could make sense to do the param expiration check in ReceiveTicket() after the basic ticket validity checks to make sure that we always redeem any winning tickets received.

Gotcha 👍

From discord:

Nico: we would still want to create something like a blocknumberWatcher in eth/watchers we could then inject that abstracted in the pm package

Yondon: Do we need to create an additional type? Seems like we already have all the functionality that we need since the block watcher can be used to subscribe to block updates and the DB can be used to fetch the current block number.

I think using a blockwatch subscription would require the pm package to import the blockwatch package as the type returned from subscription channel is of type []*blockwatch.Events and the block number is nested in the blockwatch.MiniHeader.

The current structure of the pm package is to be low-level and tries to not import other parts of the go-livepeer codebase, given this effort I'd prefer not introducing blockwatch into pm.

The current prototype reads all values from the DB. For simply reading the state to check what the latest block number is it queries the DB. For sending updates to the ticketQueue we poll the DB and send updates into a channel.

Implementing a BlockNumWatcher would allow sender, recipient and ticketQueue to read from memory.

yondonfu commented 4 years ago

Handling expired payment params

extremely short ticket validity periods (eg, 2 blocks or ~30 seconds) are likely to make reliability worse, absent further engineering effort.

Yeah good point @j0sh - thanks for bringing this up. Since longer validity periods don't solve the problem (if the validity period is N blocks and B sticks with the same O for N blocks then B still has to deal with expiring params for its other sessions), looks like we'll need a way for B to refresh the payment params for Os in its working set. This would be an additional refresh mechanism on top of the existing payment param refresh mechanisms used by O (polling every hour to update cached payment params in the DB and updating a session's payment params with those returned by O in a response).

The candidate mechanisms proposed thus far are:

  1. Check if the payment params are expired before submitting a segment using a session. If they are expired, first call GetOrchestrator() and update the session with the returned payment params.

    • Pros
      • Simple to implement
    • Cons
      • Extra network round trip prior to the first segment submitted whenever B uses a session with expired payment params params which adds to the latency for the first segment for the new O
  2. At a regular interval, loop through the sessions in the working set and for each session with expired payment params, call GetOrchestrator() to update the session's payment params

    • Pros
      • No additional latency when failing over
    • Cons
      • More complex implementation due to concurrent updates to the session list. The current session refresh mechanism only adds new sessions. This implementation would need to modify existing sessions.
      • More network traffic even when it is not needed. Since sessions are likely to have payment params that have similar expiration times, B will likely send N - 1 requests simultaneously when payment params expire (minus 1 because one session is likely being used for the current segment) where N is the size of the working set.
  3. Rotate through Os in the working set so that B receives up-to-date payment params with each response

    • Pros
      • Use an existing payment param refresh mechanism (update with params included in response)
    • Cons
      • B loses the ability to minimize latency because it has to use different Os
      • Couples the requirements of payment param refreshes to the selection strategy which feels sub-optimal - ideally the selection strategy would not have to be designed around the need to refresh payment params

I lean towards trying out option 1 right now because it is the simplest and if the extra network round trip for the first segment submitted when failing over to a session with expired payment params turns out to be a significant problem then we can consider option 2.

If we implement option 1 we should handle expiration checks in a different manner depending on when they are done:

The last two points are needed or else B could end up in the following infinite loop:

  1. B calls GetOrchestrator() because the expiration check failed
  2. O returns payment params that are expired
  3. B calls GetOrchestrator() because the expiration check failed.
  4. Repeat.

Failing early when validating a batch of tickets

I would actually prefer the first option since we already import pm into core/orchestrator.go anyway, so we could return a pm.ErrInvalidTicketSignature

We would probably want to keep iterating through the tickets if we encounter an invalid signature, but stop iterating through the tickets if we encounter any of the other errors in Validator.ValidateTicket() (since they would result in a tickets that are not redeemable and those params are shared with the rest of the tickets in the batch). I think this discussion is outside of the scope of this project though so I suggest we move it to a separate issue.

Delayed ticket redemption

The current structure of the pm package is to be low-level and tries to not import other parts of the go-livepeer codebase, given this effort I'd prefer not introducing blockwatch into pm.

Instead of tightly coupling the delayed ticket redemption with the block time by relying on blocks received by the block watcher, what about just using a fixed time interval for pulling tickets off the queue? In practice, the fixed time interval can be an approximation of the block time, but it doesn't need to be. A time interval that is slightly longer than the block time might actually reduce # of required operations in certain circumstances - there is no point in pulling a ticket off the queue with each block if there are pending redemption txs that cause the sender's max float to be insufficient to cover the face value of the next ticket because the ticket is just going to be re-added to the queue. Pulling a ticket off the queue after a period of time that exceeds the block time shouldn't be a big deal either except for the scenario when we're close to the end of a round, but as we've already mentioned in this discussion we'll eventually need special case logic for that anyway.

One way to approach this would be:

yondonfu commented 4 years ago

@kyriediculous Given that the original motivation for this project was to make sure that payment errors do not degrade reliability, I think we should document a plan for testing that the feature described by this spec eliminates payment errors when B & O follow the payment param protocol described in the spec.

We should test the following reliability focused scenarios:

  1. B sends > 3 concurrent streams into a single O and the gas price returned by eth_gasPrice changes causing O to create new payment params. B's reserve should be high enough such that the face value set by O always exceeds B's max float. Otherwise, the face value would not change even when the gas price changes because it would just be set to B's max float.

  2. B's working set contains O1 & O2. B sends > 3 concurrent streams into O1 for a period of time that exceeds the validity period for payment params. Then, O1 crashes leading B to failover to O2. This is the scenario described by Josh earlier in the thread that led to the decision to call GetOrchestrator() before submitting a segment if the payment params are expired.

Some thoughts on testing scenario 1:

Some thoughts on testing scenario 2:

If we're using the test-harness, it might be helpful to run these tests using GCP instead of running them locally to avoid any issues with local resources (used by the nodes we're spinning up).

kyriediculous commented 4 years ago

Yup totally agree testing the cases you mentioned. I'm wondering what an easy path is to test the added latency caused by an OrchestratorInfo refresh as described in scenario 2

I opened #1383 that implements this spec (after I edit it) as well as the OrchestratorInfo refresh. Could you take a look at this as well @j0sh and give your opinion on the retry vs background refreshing for all O's? Thanks !