Closed MaxGelbakhiani closed 7 months ago
pprof analysis results:
alloc_space=82TB, inuse_space=39GB
alloc_space=156GB, inuse_space=12GB
so the weak place is object slicer. We should optimize it
Kinda a scary thing for 1K object payload. I would say it is a mistake. It is hard to handle such a memory load with any memory capacity if we require 64/128MB per every object put.
@carpawell thanks 4 ur support, i know about this problem and currently trying to support workaround for it
@MaxGelbakhiani, can you, please, share the RPC (objects put per sec i mean) you had in your test run?
@MaxGelbakhiani, can you, please, share the RPC (objects put per sec i mean) you had in your test run?
For this exact test run I didn't have final results showing the rate and RPC as node failed. But it should be similar to these numbers which I got from the other test run with the same setup.
data_received............: 0 B 0 B/s
data_sent................: 571 MB 952 kB/s
iteration_duration.......: avg=107.56ms min=3.35µs med=93.16ms max=1.24s p(90)=175.32ms p(95)=211.38ms
iterations...............: 557700 929.300612/s
neofs_obj_put_duration...: avg=107.12ms min=25.89ms med=92.71ms max=1.24s p(90)=174.79ms p(95)=210.83ms
neofs_obj_put_total......: 557700 929.300612/s
vus......................: 100 min=100 max=100
vus_max..................: 100 min=100 max=100
@MaxGelbakhiani could u pls try to test the same scenario on:
2af7692074615fb139a358bf439aa0981beaea7c
2d67e380a3dc536b1f5e44df6242357a85374af2
With the provided builds and the same testcase with 1Kb objects load we have the following performance metrics:
data_received............: 0 B 0 B/s
data_sent................: 1.1 GB 1.8 MB/s
iteration_duration.......: avg=57.69ms min=4.62µs med=48.44ms max=842.1ms p(90)=91.5ms p(95)=110ms
iterations...............: 1039606 1732.480333/s
neofs_obj_put_duration...: avg=57.23ms min=11.64ms med=47.97ms max=841.65ms p(90)=91.01ms p(95)=109.51ms
neofs_obj_put_total......: 1039606 1732.480333/s
vus......................: 100 min=100 max=100
vus_max..................: 100 min=100 max=100
running (10m00.1s), 000/100 VUs, 1039606 complete and 0 interrupted iterations
write ✓ [ 100% ] 100 VUs 10m0s
No OOMs during 10 min load.
Profiles were sliced each 5 seconds: 19_Dec_OOM_issue_2686_GRPC_1Kb_REP-3_Containers=50_Objects=0_Endpoints=1_Readers=0_Writers=100_Duration=10m.zip
If necessary, I can extend the runtime for this test case to have it longer than 10 minutes.
Ran a 5 hour test yesterday with neofs-node@2d67e380a3d binary. The run completed successfully with the following results:
data_received............: 0 B 0 B/s
data_sent................: 22 GB 1.2 MB/s
iteration_duration.......: avg=83.94ms min=4.85µs med=67.41ms max=1m0s p(90)=125.11ms p(95)=162.09ms
iterations...............: 21437214 1190.947174/s
neofs_obj_put_duration...: avg=83.56ms min=11.44ms med=67.03ms max=52.92s p(90)=124.72ms p(95)=161.7ms
neofs_obj_put_fails......: 1 0.000056/s
neofs_obj_put_total......: 21437214 1190.947174/s
vus......................: 100 min=100 max=100
vus_max..................: 100 min=100 max=100
running (5h00m00.1s), 000/100 VUs, 21437214 complete and 0 interrupted iterations
write ✓ [ 100% ] 100 VUs 5h0m0s
No OOMs during 5 hours.
Profiles were sliced every 2 minutes: 5_hours_profiles_part_1.zip 5_hours_profiles_part_2.zip
according to the results (profiles and undying nodes) we can consider fix as working and fixing particular existing problem
with fix, we may also observe other weak places:
with fix
What fix?
What fix?
with any
I mean, can you provide a PR link or at least a branch name?
I mean, can you provide a PR link or at least a branch name?
stick to the revisions, they are mentioned everywhere
One force push and there is no revision. I am asking not for me, I have found everything I need, I am asking for better issue history and reproducing.
these tests were pure experimental, i dont recommend to try to reproduce them cuz we haven't recorded the cluster setup
PR is comin
@MaxGelbakhiani can we pls test the https://github.com/nspcc-dev/neofs-node/issues/2686#issuecomment-1863225737 with following revisions:
neofs-node@c9ddb541cb1ce0b4f72d6247f67dd2b2da0bd264
neofs-node@2e40379bd4905a2de88a86cda837e7ed7a11979e
with 1K
objects. Memory profiles are still needed ofc
Fixed by #2719?
afaik @MaxGelbakhiani tested #2719 with much smaller RAM nodes and OOM didnt happen, right?
Test run with #2719 was issued on nodes with 64GB of RAM. 20-minute test run ended up with a performance boost and with no OOMs during the test.
OOM during single node load test with 1K size objects
Steps to Reproduce
As a result, I got OOM. neofs-node process was killed and restarted.
Environment
The setup contains 22 nodes with 1TB of RAM.
Syslog and pprof attached. profiles were gathered with 30 seconds interval during the load. OOM allocs.zip OOM_journal.log