Closed roman-khimov closed 1 month ago
Tree service was used as a separate unit (no sharding, no other background processes), and network communications (fetching networkmap, containers, etc) were mocked; as a loader, I have added some k6 scripts and extensions: https://github.com/nspcc-dev/neofs-node/tree/feat/tree-cmd https://github.com/nspcc-dev/xk6-neofs/tree/feat/tree-loader
REP 4
"container" was mocked and 4 distributed tree instances on 4 bare metal machines with SSD were used.
This tree service config was used (sync_interval
was either 5 mins (2-3 times per test) or turned off, see below):
tree:
enabled: true
cache_size: 15
replication_worker_count: 1000
replication_channel_capacity: 1000
replication_timeout: 5s
sync_interval: 5m # depends on tests
pilorama:
max_batch_delay: 5ms
max_batch_size: 100
path: /tmp/tree.test
Every run had a "target" rate, meaning k6 tried to iterate exactly that number of operations per every second. Network deadline was 5s so the max number of VUs (virtual users, a k6 term) was 5 times bigger than a "target" RPS multiplied by 5, so it had to be from 0 to N*5 number of request-in-progress at a time where N is a "target" RPS.
Two types of load were presented: the first one was kinda "system" load (ADD
tree operation, 10 RPS for every test) that wrote to a "system" tree (multipart upload, lock operation, etc for S3 GW) and the second one was "user" load (ADD_BY_PATH
tree operation, variable number of operations in every test) that wrote to a "version" tree (regular object PUT for S3 GW). Every tree node had 7 meta fields and a unique path.
Also, there were two types of tests in terms of tree synchronization: background sync does so much bad things to results so I decided to turn it off, although in general, that is not possible in real cases since this is a mechanism that restores missing operation in local logs on any normal and unexpected downtime (https://github.com/nspcc-dev/neofs-node/pull/2161, https://github.com/nspcc-dev/neofs-node/pull/2165). Every "sync on" load was done for 15 mins., every "sync off" for 30 mins.
Avg: 13.70ms Errs: NO
Avg: 13.39 13.43ms
Errs: NO
Avg: 2094.34 19.47ms
Errs: YES, <1% NO
Avg: 1202.18 19.79ms
Errs: YES, <1% NO
Avg: 4550.52 1081.33ms
Errs: YES, 84% NO
Avg: 4082.70 1193.95ms
Errs: YES, 67% NO
Oh, 2.2.0 version of the reporter is broken. Numbers are actual but are placed in the wrong fields... Will rerun.
UPD: done.
I have done 1h tests.
Avg: 13.93ms Errs: NO
summary1000_nosync_1h.html.pdf
Avg: 85.35ms Errs: NO
summary2500_nosync_1h.html.pdf
NOTE: 55m were OK and it was close to 1000 RPS results but last 5 min I saw degradation, so that is not stable load I would say (see p95).
Avg: 536.93ms Errs: NO
@roman-khimov, saw
push ops via a single node
is it critical? I put requests throught different nodes.
Try it with a single one. Multinode is interesting as well, but it's a more complex scenario.
That is a little bit more problematic:
Avg: 4776.86ms Errs: 91%
Seems like a dead end to me, nothing we can reuse from it.
Is your feature request related to a problem? Please describe.
I'm always frustrated when we don't know what to expect from some critical components of our system. Tree service is like that. We kinda know about #1734, but we don't know exact numbers.
Describe the solution you'd like
Create an environment for tree service test (can be separated from the node (or not)). Measure single node ops/s (typical AddByPath) and delays. Add more nodes (up to 4 of them), push ops via a single node, see how they spread and what's the throughput and latency. Repeat on some real hw.