Got OOM on 2 neofs-storage nodes during k6 load perfomance run

MaxGelbakhiani commented 1 year ago

Error state

Nodes A and B in the cluster experienced OOM:

Feb 20 12:44:43 a kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/neofs-storag e.service,task=neofs-node,pid=4300,uid=996 Feb 20 12:44:43 a kernel: Out of memory: Killed process 4300 (neofs-node) total-vm:68038980kB, anon-rss:64929772kB, file-rss:0kB, shmem-rss:0kB, UID:9 96 pgtables:128188kB oom_score_adj:0 Feb 20 12:44:43 a systemd[1]: neofs-storage.service: A process of this unit has been killed by the OOM killer.

And finally stuck with the following messages in syslog: Feb 21 10:47:10 a neofs-node[4763]: 2023-02-21T10:47:10.970+0100 warn engine/engine.go:142 could not select objects from shard {"shard_id": "GzmHQR6M4xiRUeuNKfajef", "error": "shard is in degraded mode"} Feb 21 10:47:10 a neofs-node[4763]: 2023-02-21T10:47:10.970+0100 warn engine/engine.go:142 could not select objects from shard {"shard_id": "LZ1SHWnvk3vL1Y1PGVRhgJ", "error": "shard is in degraded mode"}

I also noticed "can't transfer assets" from neofs-ir: Feb 21 11:17:10 a neofs-ir[5437]: 2023-02-21T11:17:10.759+0100 error innerring/settlement.go:240 basic income: could not send transfer {"sender": "Nge3U4wJpDGK2BWGfH5VcZ5PAbC6Ro7GHY", "recipient": "NL1H4he1ggRajT1ZVqn23zveMRaH9jDvfh", "amount (GASe-12)": "2", "details": "414702000000000000", "error": "could not invoke method (transferX): neofs error: chain/client: contract execution finished with state FAULT; exception: at instruction 1807 (THROW): unhandled exception: \"can't transfer assets\""}

Testbed details

NeoFS cluster consists of 4 nodes. Before the test I had some data uploaded to NeoFS, but I cleared the container list via neofs-cli and created the new containers. Before test run nodes were rebooted.

Steps to Reproduce

The setup was as following:

I ran preset script to create 500 containers and preload objects for the later test usage.

100 Mb objects: ./scenarios/preset/preset_grpc.py --size 102400 --containers 500 --out ~/load_test_mixed.json --endpoint a.load.nspcc.ru:8080,b.load.nspcc.ru:8080,c.load.nspcc.ru:8080,d.load.nspcc.ru:8080 --preload_obj 10 --wallet scenarios/files/wallet.json --config ~/neofs-cli_cfg.yaml

32Mb objects in the same containers: date;./scenarios/preset/preset_grpc.py --size 32768 --containers 100 --update ~/load_test_mixed.json --out ~/load_test_mixed.json --endpoint a.load.nspcc.ru:8080,b.load.nspcc.ru:8080,c.load.nspcc.ru:8080,d.load.nspcc.ru:8080 --preload_obj 10 --wallet scenarios/files/wallet.json --config ~/neofs-cli_cfg.yaml

512 Kb objects in the same containers: date;./scenarios/preset/preset_grpc.py --size 512 --containers 350 --update ~/load_test_mixed.json --out ~/load_test_mixed.json --endpoint a.load.nspcc.ru:8080,b.load.nspcc.ru:8080,c.load.nspcc.ru:8080,d.load.nspcc.ru:8080 --preload_obj 10 --wallet scenarios/files/wallet.json --config ~/neofs-cli_cfg.yaml

During the preload I got the following errors: ` > Upload objects for container 3V1WwdaAVzawp769GG91JVXnkbHmP4EBkmn6Nk7HoDUe

Object has not been uploaded: can't create API client: can't init SDK client: open gRPC connection: gRPC dial: context deadline exceeded

Object has not been uploaded: can't create API client: can't init SDK client: open gRPC connection: gRPC dial: context deadline exceeded

Object has not been uploaded: can't create API client: can't init SDK client: open gRPC connection: gRPC dial: context deadline exceeded

Object has not been uploaded: rpc error: client failure: context deadline exceeded

Object has not been uploaded: rpc error: client failure: context deadline exceeded

Object has not been uploaded: rpc error: client failure: context deadline exceeded

Object has not been uploaded: rpc error: client failure: context deadline exceeded

Object has not been uploaded: rpc error: client failure: context deadline exceeded

Object has not been uploaded: rpc error: client failure: context deadline exceeded

Object has not been uploaded: rpc error: client failure: context deadline exceeded

Upload objects for container 3V1WwdaAVzawp769GG91JVXnkbHmP4EBkmn6Nk7HoDUe: Completed Upload`

And xk6 run failed with the same error exception: ` date;./k6 run -e DURATION=600 -e WRITERS=20 -e READERS=20 -e DELETERS=10 -e DELETE_AGE=10 -e REGISTRY_FILE=~/registry_mixed_load_test.json -e WRITE_OBJ_SIZE=512 -e GRPC_ENDPOINTS=a.load.nspcc.ru:8080,b.load.nspcc.ru:8080,c.load.nspcc.ru:8080,d.load.nspcc.ru:8080 -e PREGEN_JSON=~/load_test_mixed.json scenarios/grpc.js Tue Feb 21 05:23:45 CET 2023

         /\      |‾‾| /‾‾/   /‾‾/
    /\  /  \     |  |/  /   /  /
   /  \/    \    |     (   /   ‾‾\
  /          \   |  |\  \ |  (‾)  |
 / __________ \  |__| \__\ \_____/ .io

execution: local script: scenarios/grpc.js output: -

scenarios: (100.00%) 3 scenarios, 50 max VUs, 10m5s max duration (incl. graceful stop):

delete: 10 looping VUs for 10m0s (exec: obj_delete, gracefulStop: 5s)
read: 20 looping VUs for 10m0s (exec: obj_read, gracefulStop: 5s)
write: 20 looping VUs for 10m0s (exec: obj_write, gracefulStop: 5s)

ERRO[0007] GoError: dial endpoint: open gRPC connection: gRPC dial: context deadline exceeded delete at reflect.methodValueCall (native) read at file:///home/maxim/xk6-neofs/scenarios/grpc.js:20:141(145) hint="error while initializing VU #32 (script exception)"

Init [===================>------------------] 27/50 VUs initialized delete [--------------------------------------] read [--------------------------------------] write [--------------------------------------] maxim@l:~/xk6-neofs$`

Your Environment

root@a ~ # neofs-node --version NeoFS Storage node Version: v0.35.0 GoVersion: go1.17.13 root@a ~ # neofs-ir --version NeoFS Inner Ring node Version: v0.35.0 GoVersion: go1.17.13 root@a ~ # neogo --version NeoGo Version: 0.100.1 GoVersion: go1.19.4 root@a ~ #

Four nodes have the same configuration with neogo, neofs-node and neofs-ir running on Linux a 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux

The following archives are enclosed containing logs and node config: ├── a.load.zip │ ├── a.load-config.yml │ ├── dmesg-a.load.nspcc.ru.log │ └── journalctl-current-boot-a.load.nspcc.ru.log ├── b.load.zip │ ├── b.load-config.yml │ ├── dmesg-b.load.nspcc.ru.log │ └── journalctl-current-boot-b.load.nspcc.ru.log ├── c.load.zip │ ├── c.load-config.yml │ ├── dmesg-c.load.nspcc.ru.log │ └── journalctl-current-boot-c.load.nspcc.ru.log ├── d.load.zip │ ├── d.load-config.yml │ ├── dmesg-d.load.nspcc.ru.log │ └── journalctl-current-boot-d.load.nspcc.ru.log

Links for download: a.load.zip

b.load.zip

c.load.zip

d.load.zip

MaxGelbakhiani commented 1 year ago

Yesterday I also tried to switch shard_ro_error_threshold from 30 to 0 (turning it off), but I can't say it changes the system behavior much in this case.

roman-khimov commented 1 year ago

How many replicas were used in this test? It looks a lot like #2300 to me.

roman-khimov commented 1 year ago

I think we can no longer reproduce it and we'll have better memory profile soon.

nspcc-dev / neofs-node