Open AlexeyAkhunov opened 2 years ago
I just removed persistence of dictionary file https://github.com/ledgerwatch/erigon-lib/pull/283 But performance issue still exists (I mean this issue is valid)
Added creation of superstrings immediately - instead of writing to file first: by https://github.com/ledgerwatch/erigon-lib/pull/284 . We still need to create uncompressedFile file - because we need read data twice (for reducedict). Sequential compresser also doing it. Need to add here same trick as in ETL - create uncompressedFile only when it > etl.BufferOptimalSize.
Performance issue still exists (I mean this issue is valid).
Both can found in
compress
package.CompressorSequential
has been written for optimal performance in a single thread. Compressor (formerly known as ParallelCompressor) is used for prototypes and experiments and is therefore aiming at utilising maximum resources to run prototypes faster. But maintaining two variants of the same code is error prone. Aggregator (part of Erigon2 prototype) has been switched toCompressor
(parallel compressor) and now it is runs slower. My suspicion is that parallel compressor is wasting a lot of time on dispatching work, scheduling and on extra memory allocations to make sure thread-safely. We would like to profile those areas and optimise them.For more context, in production, it is likely we will run compressor in a SINGLE background thread. So it may not even need to spawn goroutines in that mode. Parallel mode would only be used for experiments and prototypes.
Beyond Erigon2 prototype, compressor is currently used to package block header and block body snapshots. Requirement there (as well as in Erigon2 prototype) that optimisations do not change the resulting compressed file. Also, regardless of number of workers, the resulting compressed file should be the same. However, if we find an optimisation that requires change of the file format, we will definitely consider it!