linera-io / linera-protocol

Main repository for the Linera protocol
Apache License 2.0
112 stars 94 forks source link

Implement multiple connection points for the `storage_service` #2181

Closed MathieuDutSik closed 1 week ago

MathieuDutSik commented 1 week ago

Motivation

Detailed printing of latencies for the end-to-end tests revealed that for a sequence of read_value_bytes statements, the latency increased from a small value of less than a microsecond to a maximum of 26 microseconds. That is inadequate.

Proposal

The circumstance of this high latency is due to two factors:

One first attempt at correcting the problem is to build a connection directly when needed in each call. While this decreases the maximum runtime from 26 milliseconds to about 10 milliseconds, this increases the minimum runtime since a connection has to be established and this takes about 10 milliseconds. And his creates some timeout errors.

Therefore, the solution was to have a vector of connections that is extended when all connections are used. This works to get a good speed. The access to the list of connections takes 5 microseconds when present which is more than I would expect from the cost of locks but is consistent.

The construction with

    clients: Arc<Mutex<Vec<Arc<RwLock<StoreProcessorClient<Channel>>>>>>,

is necessary for the following reasons.

The position of the semaphore is changed as well.

In the end, what we get is that the calls take about 2-5 milliseconds at worst, which is better. But we should note that when doing an isolated operation the runtime is less than a millisecond. This is for example the case of find_keys_by_prefix operations.

More investigation is needed of timeouts as some excessive timeouts for write_batch are apparent in logs.

Test Plan

The CI is the goal to optimize. When run locally, there are improvements to the speed of the individual calls. It is possible that some timeouts are eliminated by it. However, when running locally we still have the error on the processInbox for the matching_engine.

Therefore, the worst case of this PR is that it does not change the CI behavior and the best case is that it reduces the occurrences of timeouts in the runs.

Release Plan

Not relevant.

Links