vegaprotocol / vega

A Go implementation of the Vega Protocol, a protocol for creating and trading derivatives on a fully decentralised network.
https://vega.xyz
GNU Affero General Public License v3.0
37 stars 19 forks source link

Investigate, understand, and fix snapshot performance #5311

Closed wwestgarth closed 2 years ago

wwestgarth commented 2 years ago

Spike Overview

In order to understand the issues with the performance of snapshots We will investigate the issue seen in known environments So that have a plan for how to improve the performance, and possible implement any quick wins

Acceptance Criteria

How do we know when this spike is ready to either drop or move into technical tasks:

Additional Details (optional)

Environments where performance issues have been seen:

Known potential slow areas:

Possible methods to investigate:

Some data from a recent investigation:

#### snapshot took  {"time": 2.172512371}
The main offenders:
saving new snapshot {"time": 0.686214668}
banking.seen took   {"time": 0.704053724}
eventforwarder.all  {"time": 0.384269236}
pow.pow             {"time": 0.028102784}

The above four took a total of: 1.802640412 which is 82% of the total time

related: https://github.com/vegaprotocol/vega/issues/4243

It may be worth considering alternative sorting algorithms for the engines that are known to have large data-sets to sort. https://github.com/twotwotwo/sorts and https://github.com/jfcg/sorty are examples and have been breifly invetsigated: image

wwestgarth commented 2 years ago

Playing around with some quicker/multi-threaded sorting package it turns out that the sorting is only part of the slowness -- the call to proto.Marshal() and then to vgcrypto.Hash() are also taking a fair chunk of time.

For banking.seen we have a speed up in sorting times:

sort.Slice()  sorty.SortSlice()
0.576144895   0.120757528

but the marshalling of the sorted array, and the hashing that byte-string to be returned by GetHash() both take ~0.1s so even with the speed up 2/3 of the time is spent outside of the sorting.

For evtforward.all:

sort.Slice()     sorty.SortSlice()
0.149291711      0.04777019

but again we have similar time spent of ~0.1s on both the marshalling and the hashing. For both banking and evtforward there are around 350,000 "things" being sorted and serialised.

So switching to a different sorting algorithm does help, but not as much as thought. To be able to shave off any more time we'll have to go downt he route of spawning a go-routine for each snapshot-provider at the start of taking a snapshot, and read the results when they're done so that all the marshalling/hashing is done concurrently.

jeremyletang commented 2 years ago

Alright, yea concurrency seems like the only way to go. We didn't do it early just to keep things simple, but we should go for it now

ze97286 commented 2 years ago

Closed by #5334 #5335 #5344 #5347 #5350 #5357