Closed kyriediculous closed 4 years ago
We can simply extend the current TicketParams protobuf message with a new field
If the expiration block also applies to the price then it might make sense to add the field to OrchestratorInfo
i.e.
message OrchestratorInfo {
...
int64 payment_params_expiration_block;
}
A set of TicketParams T would have to remain valid until the next advertised set of TicketParams T+1 is able to be received by the Broadcaster so that the Broadcaster is given ample time to update without disrupting transcoding due to minor latency discrepancies.
As we discussed on our call, another factor to consider in setting the expiration time is the lack of synchronization of the view of the latest block for B & O - the two parties may not always have the same view of the latest block. An expiration time of >= 2 blocks should help with this.
To keep track of previously advertised ticket parameters we can create a mapping on the recipient to replace the current recipient.invalidRands map.
As mentioned on our call, we don't need to do any additional bookkeeping as long as we include the expirationBlock
as an input to the HMAC along with the ticket faceValue
and winProb
. The recipientRand
generated by O will not be the pre-image for the recipientRandHash
included in the ticket sent by B if O did not previously advertise the faceValue
and winProb
of the ticket with the provided expirationBlock
.
Not including PriceInfo in this validation scheme could be fine as accounting should make sure an Orchestrator is paid fairly.
As we discussed on our call, we do need to validate the price or else the following scenario is possible:
Including the price as an input in the HMAC allows O to validate that it previously advertised the price and that it is not yet expired. B would expect O to accept the price included in a payment as long as the price is not yet expired.
Within the current version of the payment protocol which uses deterministic signatures
Slight correction: The current PM protocol actually does not use deterministic signatures since the ECDSA scheme used by the node is a randomized signature scheme. An example of a deterministic signature scheme would be RSA.
Instead of trying to redeem tickets in the queue when the maxFloat for a sender changes, we can run the loop upon discovery of a new block. A check can be added where we check for sufficient max float from the sender to redeem the ticket, to also check its expiration.
So this would turn the queue from a "retry queue" where we only add tickets if they need to be retried into a general "redemption queue" where we add all tickets that should be redeemed. I think this should work. The queue will need a way to access the max float (which is currently stored in the SenderMonitor
) every time a new block is received. At the moment, the queue only receives the max float value when the max float changes as signaled by the SenderMonitor
.
Tracking accounting balances per Broadcaster address instead of Manifest ID is still an open issue and area of discussion. To ensure sufficient safeguarding by the accounting mechanism it could be included in the scope of this project.
I think this can be considered separately from this project.
If the expiration block also applies to the price then it might make sense to add the field to OrchestratorInfo i.e.
Correct, this was a consideration made due to the section about excluding price from the equasion
the two parties may not always have the same view of the latest block. An expiration time of >= 2 blocks should help with this.
A blockPolling
interval smaller than 1/2 the blocktime could also help with synchronization issues, the faster we retry when missing a block the faster the views are in sync. The default is currently at 5 seconds, so I think this should be fine overall.
Increasing the expiration time any further will impact the transaction cost overhead assigned to the ticket at creation time and the actual transaction cost at redemption time. We have to consider a good balance her and I think creationBlock + 1 <= lastSeenBlock
constituting non-expired would be a good fit as discussed.
The queue will need a way to access the max float (which is currently stored in the SenderMonitor) every time a new block is received. At the moment, the queue only receives the max float value when the max float changes as signaled by the SenderMonitor.
Would it though? I see two options here.
remove the maxFloat check in the ticketQueue
consume loop when trying to send a ticket into the redeem channel , we do this check in recipient.redeemWinningTicket
anyway and the ticket is re-added to the queue if the float isn't sufficient.
call maxFloat and signal it to the queue before adding a ticket to the queue.
I like option 1 better but we do change the purpose of the queue as you mentioned. It also removes the need to do any additional state tracking on the ticketQueue
.
Example:
I think this can be considered separately from this project.
Gotcha
The following is a copy/paste from discord but it's a consideration I'm having currently as well:
Currently whenever we encounter a recipient.ReceiveTicket()
error in orch.ProcessPayment()
we will:
For bullet point 2, this case no longer exists, after removing acceptable/unacceptable errors if recipient.ReceiveTicket() returns an error, the ticket will never be winning either. Previously we would check in recipient.acceptTicket() whether there would be an acceptable error.
For bullet point 1 however there is the question then if we should still continue with recipient.ReceiveTicket() for each ticket in the payment if the first one returns an error. The only case for which this error wouldn't be caused by faulty TicketParams (I think) is the signature verification. For all other cases we might as well stop trying to receive tickets. But then again I wonder what is the case where we have faulty signatures for only a couple of tickets in the batch assuming nodes are following protocol.
remove the maxFloat check in the ticketQueue consume loop when trying to send a ticket into the redeem channel , we do this check in recipient.redeemWinningTicket anyway and the ticket is re-added to the queue if the float isn't sufficient.
Good point. This sounds good to me.
For bullet point 2, this case no longer exists
We can still receive a winning ticket even if an error is returned if we check the param expiration after running the basic ticket validity checks (via Validator.ValidateTicket()
). As long as the basic ticket validity checks pass, the ticket can still be redeemable on-chain if it wins. So, it could make sense to do the param expiration check in ReceiveTicket()
after the basic ticket validity checks to make sure that we always redeem any winning tickets received. If we do this, then the caller of ReceiveTicket()
would still do something like:
_, won, err := recipient.ReceiveTicket(...)
if won {
recipient.RedeemWinningTicket(...)
}
if err != nil {
// Handle error
}
For bullet point 1 however there is the question then if we should still continue with recipient.ReceiveTicket() for each ticket in the payment if the first one returns an error. The only case for which this error wouldn't be caused by faulty TicketParams (I think) is the signature verification.
While I agree that it seems unlikely that in a batch of tickets one signature could be invalid while the others are valid, since its not impossible it seems reasonable to continue validating the other signatures.
The main scenario for continuing through the rest of the tickets in a batch when hitting an error is when the error is due to params expiration and the reason for continuing is to check if any of the tickets won.
The one scenario where it might make sense to not continue with the rest of the tickets in a batch is if the the basic ticket validity checks in Validator.ValidateTicket()
besides the signature check (for the reason mentioned above - the other non-signature checks are based off of param values shared across all tickets so if a single ticket fails one of these checks the rest would as well) fail. Handling this scenario would require either checking for a specific error returned by Validator.ValidateTicket()
to determine whether we should stop receiving tickets early or refactoring Recipient
and Validator
to check the shared param values of tickets first and then loop through the unique param values (i.e. sender nonce + signature) of tickets after. I think the second option would be preferred, but I don't think implementing that option to handle this scenario is urgent right now.
One comment: extremely short ticket validity periods (eg, 2 blocks or ~30 seconds) are likely to make reliability worse, absent further engineering effort. Consider the following scenario:
One way to mitigate this would be for the B to force a refresh for every expiration. But that introduces a level of chattiness into the networking protocol that I'm a bit uncomfortable with. Another option is to "rotate" among Os in the active set, but B loses the ability to minimize latency.
@j0sh very good point whenever we switch orchestrators it's actually very likely that we received expired TicketParams
since we only refresh when we don't have enough sessions.
We could force a refresh for the selected orchestrator only once if the ticketparams are expired. I'm not sure if that's a reasonable tradeoff to make here. The time for the request could be considered additional "warm-up time" for future selection algo updates since after that the B should just get updated params with each segment.
I think the second option would be preferred, but I don't think implementing that option to handle this scenario is urgent right now.
I would actually prefer the first option since we already import pm
into core/orchestrator.go
anyway, so we could return a pm.ErrInvalidTicketSignature
(I don't think we need an error type to assert upchain here?). I think it might make sense to just include it given ticket batch sizes on mainnet.
So, it could make sense to do the param expiration check in ReceiveTicket() after the basic ticket validity checks to make sure that we always redeem any winning tickets received.
Gotcha 👍
From discord:
Nico: we would still want to create something like a blocknumberWatcher in eth/watchers we could then inject that abstracted in the pm package
Yondon: Do we need to create an additional type? Seems like we already have all the functionality that we need since the block watcher can be used to subscribe to block updates and the DB can be used to fetch the current block number.
I think using a blockwatch subscription would require the pm
package to import the blockwatch
package as the type returned from subscription channel is of type []*blockwatch.Events
and the block number is nested in the blockwatch.MiniHeader
.
The current structure of the pm
package is to be low-level and tries to not import other parts of the go-livepeer codebase, given this effort I'd prefer not introducing blockwatch
into pm
.
The current prototype reads all values from the DB. For simply reading the state to check what the latest block number is it queries the DB. For sending updates to the ticketQueue we poll the DB and send updates into a channel.
Implementing a BlockNumWatcher would allow sender
, recipient
and ticketQueue
to read from memory.
extremely short ticket validity periods (eg, 2 blocks or ~30 seconds) are likely to make reliability worse, absent further engineering effort.
Yeah good point @j0sh - thanks for bringing this up. Since longer validity periods don't solve the problem (if the validity period is N blocks and B sticks with the same O for N blocks then B still has to deal with expiring params for its other sessions), looks like we'll need a way for B to refresh the payment params for Os in its working set. This would be an additional refresh mechanism on top of the existing payment param refresh mechanisms used by O (polling every hour to update cached payment params in the DB and updating a session's payment params with those returned by O in a response).
The candidate mechanisms proposed thus far are:
Check if the payment params are expired before submitting a segment using a session. If they are expired, first call GetOrchestrator()
and update the session with the returned payment params.
At a regular interval, loop through the sessions in the working set and for each session with expired payment params, call GetOrchestrator()
to update the session's payment params
Rotate through Os in the working set so that B receives up-to-date payment params with each response
I lean towards trying out option 1 right now because it is the simplest and if the extra network round trip for the first segment submitted when failing over to a session with expired payment params turns out to be a significant problem then we can consider option 2.
If we implement option 1 we should handle expiration checks in a different manner depending on when they are done:
GetOrchestrator()
The last two points are needed or else B could end up in the following infinite loop:
GetOrchestrator()
because the expiration check failedGetOrchestrator()
because the expiration check failed.I would actually prefer the first option since we already import pm into core/orchestrator.go anyway, so we could return a pm.ErrInvalidTicketSignature
We would probably want to keep iterating through the tickets if we encounter an invalid signature, but stop iterating through the tickets if we encounter any of the other errors in Validator.ValidateTicket()
(since they would result in a tickets that are not redeemable and those params are shared with the rest of the tickets in the batch). I think this discussion is outside of the scope of this project though so I suggest we move it to a separate issue.
The current structure of the pm package is to be low-level and tries to not import other parts of the go-livepeer codebase, given this effort I'd prefer not introducing blockwatch into pm.
Instead of tightly coupling the delayed ticket redemption with the block time by relying on blocks received by the block watcher, what about just using a fixed time interval for pulling tickets off the queue? In practice, the fixed time interval can be an approximation of the block time, but it doesn't need to be. A time interval that is slightly longer than the block time might actually reduce # of required operations in certain circumstances - there is no point in pulling a ticket off the queue with each block if there are pending redemption txs that cause the sender's max float to be insufficient to cover the face value of the next ticket because the ticket is just going to be re-added to the queue. Pulling a ticket off the queue after a period of time that exceeds the block time shouldn't be a big deal either except for the scenario when we're close to the end of a round, but as we've already mentioned in this discussion we'll eventually need special case logic for that anyway.
One way to approach this would be:
SignalMaxFloat()
to accept zero params and it just uses a struct {}
chan to signal to the queue loop to pull a ticket from the queueSenderMonitor.startCleanupLoop()
(maybe rename this function as well to reflect that it runs cleanup operations and ticket queue operations)SignalMaxFloat()
on each of their queues@kyriediculous Given that the original motivation for this project was to make sure that payment errors do not degrade reliability, I think we should document a plan for testing that the feature described by this spec eliminates payment errors when B & O follow the payment param protocol described in the spec.
We should test the following reliability focused scenarios:
B sends > 3 concurrent streams into a single O and the gas price returned by eth_gasPrice
changes causing O to create new payment params. B's reserve should be high enough such that the face value set by O always exceeds B's max float. Otherwise, the face value would not change even when the gas price changes because it would just be set to B's max float.
B's working set contains O1 & O2. B sends > 3 concurrent streams into O1 for a period of time that exceeds the validity period for payment params. Then, O1 crashes leading B to failover to O2. This is the scenario described by Josh earlier in the thread that led to the decision to call GetOrchestrator()
before submitting a segment if the payment params are expired.
Some thoughts on testing scenario 1:
Some thoughts on testing scenario 2:
If we're using the test-harness, it might be helpful to run these tests using GCP instead of running them locally to avoid any issues with local resources (used by the nodes we're spinning up).
Yup totally agree testing the cases you mentioned. I'm wondering what an easy path is to test the added latency caused by an OrchestratorInfo
refresh as described in scenario 2
I opened #1383 that implements this spec (after I edit it) as well as the OrchestratorInfo
refresh. Could you take a look at this as well @j0sh and give your opinion on the retry vs background refreshing for all O's? Thanks !
TicketParams
Expirationname:
TicketParams
expiration about: Add an expiration block number toTicketParams
author: @kyriediculous (based on an original idea form @yondonfu )Abstract
OrchestratorInfo
is the data structure that is communicated over the wire from an Orchestrator to a Broadcaster. It contains two fields important for payments to work: pricing information and ticket parameters for the probabilistic micropayments protocol.[1] go_livepeer/net/lp_rpc.proto
Due to the client side implementation of the payment protocol, this information is subject to regular change: to avoid dealing with gas price fluctuations on the Ethereum network for redeeming winning payments a transaction overhead is charged.
Other reasons for change of
TicketParams
would be a winning ticket redemption on-chain or manual change in the node configuration.When
OrchestratorInfo
changes mid-stream a Broadcaster is unaware that the old parameters will no longer be honoured by an Orchestrator resulting in a payment error and interruption of transcoding. The Orchestrator currently offers a grace period, but that could prove to be a game-able mechanic allowing a Broadcaster to send anyTicketParams
while in the grace period.We can remove the current grace period implementation and instead introduce the concept of an expiration time on
TicketParams
. This would:Give protocol certainty to the Broadcaster that payments with non-expired parameters are honoured, and thus transcoding not interrupted, if an Orchestrator is not deviating from the protocol.
Gives an Orchestrator protocol certainty that it will always receive payments with valid parameters and thus redeemable tickets, as long as the payments parameters are not expired and the Broadcaster is not deviating from the protocol.
Prevents transcoding workflow from being interrupted and running into payment errors if both nodes are behaving according to the protocol rules.
Motivation
The feature should increase transcoding reliability as payment errors should no longer occur during a happy flow as well as removes the issue the number of concurrent streams from a broadcaster exceeding the maximum acceptable error count. The expiration scheme is an improvement over the current implementation of the grace period which introduces several issues of its own:
The current grace period is set at 3 acceptable payment errors. When a Broadcaster sends in more concurrent streams than this it will inevitably run over the error count when
OrchestratorInfo
changes even when neither party deviates from the protocol [2].The broadcaster is unaware if a particular set of
OrchestratorInfo
will be honoured by the Orchestrator it is working with while still sending a payment along with it's transcode request. This is can be exploited by Orchestrators by setting large negative balances for any broadcaster and high ticket winning probability, allowing to potentially redeem winning tickets without doing work.The grace period creates a new attack vector on the Orchestrator whereby a Broadcaster can get free transcoding:
Given this set of known issues there's two options:
Rather than trying to limit the attack surface of the grace period mechanism which would involve adding more complexity to an already complex mechanism we can spend slightly more effort building a more robust solution.
The expiration time mechanism is unaffected by the number of concurrent streams a Broadcaster operates, allowing it to immediately scale without additional implementation considerations.
A simpler solution will also make the
pm
package ingo-livepeer
more maintainable and accessible to new engineers.The benefits mentioned above would outweigh the downside of the extra engineering effort created by stripping out the grace period logic completely as well as having to add a few additional bytes to transcoding responses sent over the wire witch each segment by having an Orchestrator attach it's current
TicketParams
with each response.[2] https://github.com/livepeer/go-livepeer/issues/1075
Proposed Solution
From a high-level the proposed solution would look as follows:
A new field is added to the
OrchestratorInfo
messageExpirationTime
this will contain a timestamp until which the current parameters will be valid.The Broadcaster can check whether the
OrchestratorInfo
contains a validExpirationTime
during payment generation when creating its transcode request. At this point in the process we also validate the price and ticket parameters already in the current implementation ofgo-livepeer
. In caseOrchestratorInfo
is past expiration when generating a payment, the Broadcaster will request the latestOrchestratorInfo
from the Orchestrator. If no validOrchestratorInfo
is provided upon this refresh, the Orchestrator will be removed from the working set.The Orchestrator can keep track of previously advertised sets of
OrchestratorInfo
that have not expired and cross-check received tickets and their expiration time to see whether it should accept or reject an inbound transcoding request. It can do this by binding ticket parameters to therecipientRand
(HMAC) generation.To allow the Broadcaster to always have access to a non-expired set of
OrchestratorInfo
, the data should be sent back by the Orchestrator in each response to a transcode request.Implementation Tasks and Considerations
Global clock
We'll need a logical clock to give participants in the system a global synchronised view of time. We can use the block numbers of blocks in the Ethereum network to indicate expiration time.
If we use block number instead of the block timestamps we don't need to alter the current
eth/blockwatch
package (it only saves the block number) or create an additional service that queries a remote Ethereum node for the block timestamp when a new block is found.Since the timestamp is ambiguous with the block number (block timestamps are part of the block validation to make sure they are within certain bounds) there should be no difference.
Extending
OrchestratorInfo
We can simply extend the current
OrchestratorInfo
protobuf message with a new fieldExpirationBlock
with an ascending index. Broadcaster nodes that don't upgrade their node software will ignore the field.Analogously we can adjust
pm.TicketParams
to include bothExpirationBlock *big.Int
andPricePerPixel *big.Rat
so we have access to these values inside thesender
andrecipient
.Default Expiration Time
A set of
TicketParams
T would have to remain valid until the next advertised set ofTicketParams
T+1 is able to be received by the Broadcaster so that the Broadcaster is given ample time to update without disrupting transcoding due to minor latency discrepancies.Since we'll be using block numbers the expiration time will be at least the block time of the Ethereum network (current ~12 seconds).
Given that an Orchestrator needs to respond within 8 seconds anyway that would put the maximum possible time required for a Broadcaster to receive the updated
TicketParams
atblockTime + 8 seconds
.Using block numbers avoids fluctuations in block time and the expiration time would have to be 2 blocks. As a result
TicketParams
generated at block heightN
will be valid up until (but not included) block heightN+2
.TicketParams
validationAn orchestrator needs to make sure that when receiving a payment it is with parameters that it has previously advertised.
To do this we can incorporate the ticket parameters when generating the recipientRand (as an HMAC).
By binding the ticket params to the HMAC we can check that
recipientRandHash
from a received ticket is the same as one previously advertised: the Orchestrator reconstructs therecipientRand
from the parameters provided with the ticket and checks it against therecipientRandHash
on the ticket. Since the HMAC is generated with a secret only the Orchestrator knows, it is impossible for a Broadcaster to forge ticket parameters.Reworking
RoundsWatcher
into a more genericTimeWatcher
The last seen block numbers needs to be read in three different places:
sender
when generating a payment (this is part of the transcoding workflow).recipient
when generating ticket parameters and when validating received tickets.ticketQueue
to check whether to send tickets created from expiredOrchestratorInfo
can be redeemed.The easiest path to achieve this is to extend the functionalities of the
roundsWatcher
, which currently watches for the latest round number and block hash of the block it was initialised in, to keep track of the last seen block number as well. Since theroundsWatcher
already subscribes to theblockwatcher
this can be as simple as keeping track of a single additional state variable.Alternatively we could use the DB to query this information since it is already stored there, but this would require additional dependency injection as well as an alternative polling mechanism for the ticketQueue rather than relying on a channel (see Delay Ticket Redemption until
TicketParams
Expiration)Delay Ticket Redemption Until
TicketParams
ExpirationCurrently winning tickets are redeemed on-chain instantly when possible upon receiving them, revealing the current
recipientRand
value in the commit-reveal scheme.This opens up a new attack vector in the case of the
TicketParams
not yet being expired. The Orchestrator would still honour the revealedrecipientRand
but it would be possible for a Broadcaster to grind outticket.senderNonces
that would result in signatures over the ticket that result in non-winning tickets only.KECCAK256(sig, recipientRand) < winProb
Within the current version of the payment protocol which uses ECDSA signatures and a commit-reveal scheme the easiest solution to implement right now would be to delay winning ticket validation until after expiration of the
TicketParams
that constitute the ticket.We could leverage the existing
ticketQueue
for this. Caveat here is that tickets would be stored and retried on a per-sender basis despite ticket expiration being global across senders. This should be ok however.ExpirationBlock
can be added topm.TicketParams
as well aspm.Ticket
(when usingpm.NewTicket()
andbatch.Ticket()
), giving access to a ticket's expiration block in theticketQueue
.Instead of trying to redeem tickets in the queue when the
maxFloat
for a sender changes, we can run the loop upon discovery of a new block. To achieve this we can create a subscription onTimeWatcher
that allows us to pass in a sink channel (chan<- *big.Int
) on which to receive updates.Having tickets not being redeemed immediately could lead to issues at the end of rounds where the Orchestrator will be removed from the active set in the coming round. Given however that active set churn is very low momentarily and there's other end-of round improvements that can be made for the payment scheme it seems reasonable to set that concern aside for the time being.
The longer term solution anyway would be to remove interactivity from the winning ticket selection protocol by moving from deterministic signatures to a non-interactive protocol such as Verifiable Random Functions. The unknown is whether the cost of a VRF implementation to be executed on the EVM is warranted.
Refresh TicketParams when failing over
It is possible that whenever the Broadcaster needs to fail over and select a new Orchestrator that the
OrchestratorInfo
for the new Orchestrator has expired if we haven't refreshed the session recently.Consider the following scenario:
The quick solution here would be to just retry submitting the segment once and directly querying the selected Orchestrator for updated
OrchestratorInfo
. The latency added by this request should only be a couple hundred milliseconds at the maximum, otherwise the Orchestrator will likely be slow to respond with transcoded results as well.In case the latency is found to be excessive in the first iteration of testing we can move to a background service that updates
OrchestratorInfo
for each orchestrator in the working set.Testing Tasks and Considerations
Assert no payment errors when Orchestrator and Broadcaster behave according to protocol rules
Assert that an Orchestrator honours previously advertised, non-expired
TicketParams
Assert that the Orchestrator responds with a payment related error when either the
TicketParams
are invalid or have expiredAssert that the Broadcaster correctly fails over if an
Orchestrator
doesn't honour previously advertised, non-expiredTicketParams
Assert that the Broadcaster fails over when generating a payment when the Orchestrator advertises out of data
TicketParams
Assert that
OrchestratorInfo
is returned in every successful transcode response. It doesn't need to be returned if transcoding is not successful because the Broadcaster will fail over anyway.Assert that when a Broadcaster deviates from the protocol by using any price, it will run into an insufficient balance error after the first segment with a payment of 1 ticket with valid parameters
Assert that tickets are only redeemed on-chain after their associated
TicketParams
have expiredKnown Unknowns
Tracking accounting balances per Broadcaster address instead of Manifest ID is still an open issue and area of discussion. To ensure sufficient safeguarding by the accounting mechanism it could be included in the scope of this project.
The usage cost and security of current open-source VRF implementations is still an unknown. Neither do we know the engineering cost associated with writing a cost-efficient, secure low level implementation currently.
Switching from ECDSA signatures to VRF's would potentially require an additional security audit regardless of using an open-source or home tailored implementation as it is a significant change and is related heavily to the security of the payment protocol.
End-of-round optimisations get added to tech debt
Alternatives
Dynamic acceptable error count
Gracefully Retry on payment error
Additional Context
Original discussion