ledgerwatch / erigon-lib

Dependencies of Erigon project, rewritten from scratch and licensed under Apache 2.0
Apache License 2.0
59 stars 94 forks source link

Improve performance of Compressor vis-à-vis CompressorSequential #278

Open AlexeyAkhunov opened 2 years ago

AlexeyAkhunov commented 2 years ago

Both can found in compress package. CompressorSequential has been written for optimal performance in a single thread. Compressor (formerly known as ParallelCompressor) is used for prototypes and experiments and is therefore aiming at utilising maximum resources to run prototypes faster. But maintaining two variants of the same code is error prone. Aggregator (part of Erigon2 prototype) has been switched to Compressor (parallel compressor) and now it is runs slower. My suspicion is that parallel compressor is wasting a lot of time on dispatching work, scheduling and on extra memory allocations to make sure thread-safely. We would like to profile those areas and optimise them.

For more context, in production, it is likely we will run compressor in a SINGLE background thread. So it may not even need to spawn goroutines in that mode. Parallel mode would only be used for experiments and prototypes.

Beyond Erigon2 prototype, compressor is currently used to package block header and block body snapshots. Requirement there (as well as in Erigon2 prototype) that optimisations do not change the resulting compressed file. Also, regardless of number of workers, the resulting compressed file should be the same. However, if we find an optimisation that requires change of the file format, we will definitely consider it!

AskAlexSharov commented 2 years ago

I just removed persistence of dictionary file https://github.com/ledgerwatch/erigon-lib/pull/283 But performance issue still exists (I mean this issue is valid)

AskAlexSharov commented 2 years ago

Added creation of superstrings immediately - instead of writing to file first: by https://github.com/ledgerwatch/erigon-lib/pull/284 . We still need to create uncompressedFile file - because we need read data twice (for reducedict). Sequential compresser also doing it. Need to add here same trick as in ETL - create uncompressedFile only when it > etl.BufferOptimalSize.

Performance issue still exists (I mean this issue is valid).

AskAlexSharov commented 2 years ago

Related to https://github.com/ledgerwatch/erigon-lib/pull/302

AskAlexSharov commented 2 years ago

related https://github.com/ledgerwatch/erigon-lib/pull/651