livepeer / go-livepeer

Official Go implementation of the Livepeer protocol
http://livepeer.org
MIT License
541 stars 169 forks source link

Optimize fetching transcoder pool #2328

Open leszko opened 2 years ago

leszko commented 2 years ago

Livepeer fetches all active orchestrators from the chain. It happens in the function TranscoderPool(), which is called in a number of scenarios:

Fetching orchestrators happens sequentially. Each transcoder is fetched by calling the on-chain contract function GetNextTranscoderInPool(). This process already takes almost 2 min and it will grow linearly together with the number of active orchestrators.

We need to research and implement an optimization to this process. Some first ideas:

I suggest to first research the domain of solutions, describe and review them, and when we agree on the solution, finally do the actual implementation.

RiccardoBiosas commented 2 years ago

After trying out the second approach (a smart contract to aggregate the queries on-chain), it seems that all the initial performance benefits obtained by moving the transcoder-related queries from go-livepeer to an on-chain helper contract are lost when, alongside traversing the double linked list to retrieve all the transcoders addresses, single function calls are forwarded from the helper contract to the BondingManager for each transcoder address to fetch its reward/fee parameters - the call is implemented here in the go-livepeer codebase. Using an arbitrum mainnet fork as a benchmark environment and the current number of transcoders (83), the execution time increases from 22s-28s to traverse the double linked list (implemented here in go-livepeer) and return an array of all transcoder addresses to over 4minutes to traverse the linked list and fetch all the transcoder-specific data. Since it seems that the only reason why the go-livepeer TranscoderPool has to be sequential is because it needs to traverse the double linked list, perhaps the ideal solution would be:

What do you think? @leszko @yondonfu

As for the two other proposed approaches:

leszko commented 2 years ago

Thanks for the update @RiccardoBiosas . Good analysis. I think the approach is good, we fetch all the transcoders and then enrich the data with the concurrent calls. A few questions/comments:

  1. If we skip the additional data, just to fetch all transcoders addresses, did you compare the time of getting all transcoder pool using the go-livepeer code vs using your helper contract? I'd be interested in this comparison and how the time changes against the number of orchestrators. Because then we can decide if:
    1. We may just use the code we have in go-livepeer, but first fetch transcode pool and later concurrently enrich the data
    2. Use your helper contract and then concurrently enrich the data
    3. None of mentioned, and we need to find another solution (e.g. subgraph)
  2. I agree with the point about the Graph, I think it's a worse solution for the reasons you mentioned
  3. We'll probably do caching anyway, because we just call client.TranscoderPool() in too many places in the code. You're right!
RiccardoBiosas commented 2 years ago

@leszko After a bit of testing it seems that the performance difference between the on-chain and off-chain traversal of the doubly linked list is fairly negligible. On arbitrum mainnet it takes currently 28s-30s for go-livepeer to fetch all the transcoders addresses - without fetching any additional information such LastRewardRound, RewardCut and FeeShare.

leszko commented 2 years ago

@leszko After a bit of testing it seems that the performance difference between the on-chain and off-chain traversal of the doubly linked list is fairly negligible. On arbitrum mainnet it takes currently 28s-30s for go-livepeer to fetch all the transcoders addresses - without fetching any additional information such LastRewardRound, RewardCut and FeeShare.

Ok, could you check also how much time it would take if we had 1000 Os? Will the time increase in the linear manner? So, something around 10*30s = 5 min?

RiccardoBiosas commented 2 years ago
The following time measurements have been performed on rinkeby and they only focused on traversing the doubly linked list (so calling getFirstTranscoderInPool to get the head and then calling getFirstTranscoderInPool until the tail) without fetching any transcoder-specific data. Since on Rinkeby the transcoder pool size is 27, I repeated the operation of fetching the entire list of transcoders up until a given cap (100, 1000) was reached. context 10 O 100 O 1000 O
go-livepeer 0m1.7 0m17s 2m50s
onchain sm 0m7.469s 0m14.6s 0m19.8s
red-0ne commented 2 years ago

Have you considered using getStorageAt call instead of regular smart contract calls? This can save EVM overhead. It should be even better if there is access to the arbitrum node database.

leszko commented 2 years ago

Thanks @RiccardoBiosas for the analysis.

IMO 2m50s is quite long for 1000 Os, because it means that as soon as we grow to the size of 1000 Os, the broadcaster startup time is a few minutes. In that case, I'd lean toward your initial idea and use a smart contract for it. 20s as an initial time sounds acceptable. Also it does not grow linearly.