Closed MaxGelbakhiani closed 1 year ago
Yesterday I also tried to switch shard_ro_error_threshold from 30 to 0 (turning it off), but I can't say it changes the system behavior much in this case.
How many replicas were used in this test? It looks a lot like #2300 to me.
I think we can no longer reproduce it and we'll have better memory profile soon.
Error state
Nodes A and B in the cluster experienced OOM:
Feb 20 12:44:43 a kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/neofs-storag e.service,task=neofs-node,pid=4300,uid=996 Feb 20 12:44:43 a kernel: Out of memory: Killed process 4300 (neofs-node) total-vm:68038980kB, anon-rss:64929772kB, file-rss:0kB, shmem-rss:0kB, UID:9 96 pgtables:128188kB oom_score_adj:0 Feb 20 12:44:43 a systemd[1]: neofs-storage.service: A process of this unit has been killed by the OOM killer.
And finally stuck with the following messages in syslog:
Feb 21 10:47:10 a neofs-node[4763]: 2023-02-21T10:47:10.970+0100 warn engine/engine.go:142 could not select objects from shard {"shard_id": "GzmHQR6M4xiRUeuNKfajef", "error": "shard is in degraded mode"} Feb 21 10:47:10 a neofs-node[4763]: 2023-02-21T10:47:10.970+0100 warn engine/engine.go:142 could not select objects from shard {"shard_id": "LZ1SHWnvk3vL1Y1PGVRhgJ", "error": "shard is in degraded mode"}
I also noticed "can't transfer assets" from neofs-ir:
Feb 21 11:17:10 a neofs-ir[5437]: 2023-02-21T11:17:10.759+0100 error innerring/settlement.go:240 basic income: could not send transfer {"sender": "Nge3U4wJpDGK2BWGfH5VcZ5PAbC6Ro7GHY", "recipient": "NL1H4he1ggRajT1ZVqn23zveMRaH9jDvfh", "amount (GASe-12)": "2", "details": "414702000000000000", "error": "could not invoke method (transferX): neofs error: chain/client: contract execution finished with state FAULT; exception: at instruction 1807 (THROW): unhandled exception: \"can't transfer assets\""}
Testbed details
NeoFS cluster consists of 4 nodes. Before the test I had some data uploaded to NeoFS, but I cleared the container list via neofs-cli and created the new containers. Before test run nodes were rebooted.
Steps to Reproduce
The setup was as following:
I ran preset script to create 500 containers and preload objects for the later test usage.
100 Mb objects:
./scenarios/preset/preset_grpc.py --size 102400 --containers 500 --out ~/load_test_mixed.json --endpoint a.load.nspcc.ru:8080,b.load.nspcc.ru:8080,c.load.nspcc.ru:8080,d.load.nspcc.ru:8080 --preload_obj 10 --wallet scenarios/files/wallet.json --config ~/neofs-cli_cfg.yaml
32Mb objects in the same containers:
date;./scenarios/preset/preset_grpc.py --size 32768 --containers 100 --update ~/load_test_mixed.json --out ~/load_test_mixed.json --endpoint a.load.nspcc.ru:8080,b.load.nspcc.ru:8080,c.load.nspcc.ru:8080,d.load.nspcc.ru:8080 --preload_obj 10 --wallet scenarios/files/wallet.json --config ~/neofs-cli_cfg.yaml
512 Kb objects in the same containers:
date;./scenarios/preset/preset_grpc.py --size 512 --containers 350 --update ~/load_test_mixed.json --out ~/load_test_mixed.json --endpoint a.load.nspcc.ru:8080,b.load.nspcc.ru:8080,c.load.nspcc.ru:8080,d.load.nspcc.ru:8080 --preload_obj 10 --wallet scenarios/files/wallet.json --config ~/neofs-cli_cfg.yaml
During the preload I got the following errors: ` > Upload objects for container 3V1WwdaAVzawp769GG91JVXnkbHmP4EBkmn6Nk7HoDUe
And xk6 run failed with the same error exception: ` date;./k6 run -e DURATION=600 -e WRITERS=20 -e READERS=20 -e DELETERS=10 -e DELETE_AGE=10 -e REGISTRY_FILE=~/registry_mixed_load_test.json -e WRITE_OBJ_SIZE=512 -e GRPC_ENDPOINTS=a.load.nspcc.ru:8080,b.load.nspcc.ru:8080,c.load.nspcc.ru:8080,d.load.nspcc.ru:8080 -e PREGEN_JSON=~/load_test_mixed.json scenarios/grpc.js Tue Feb 21 05:23:45 CET 2023
execution: local script: scenarios/grpc.js output: -
scenarios: (100.00%) 3 scenarios, 50 max VUs, 10m5s max duration (incl. graceful stop):
ERRO[0007] GoError: dial endpoint: open gRPC connection: gRPC dial: context deadline exceeded delete at reflect.methodValueCall (native) read at file:///home/maxim/xk6-neofs/scenarios/grpc.js:20:141(145) hint="error while initializing VU #32 (script exception)"
Init [===================>------------------] 27/50 VUs initialized delete [--------------------------------------] read [--------------------------------------] write [--------------------------------------] maxim@l:~/xk6-neofs$`
Your Environment
root@a ~ # neofs-node --version NeoFS Storage node Version: v0.35.0 GoVersion: go1.17.13 root@a ~ # neofs-ir --version NeoFS Inner Ring node Version: v0.35.0 GoVersion: go1.17.13 root@a ~ # neogo --version NeoGo Version: 0.100.1 GoVersion: go1.19.4 root@a ~ #
Four nodes have the same configuration with neogo, neofs-node and neofs-ir running on
Linux a 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux
The following archives are enclosed containing logs and node config: ├── a.load.zip │ ├── a.load-config.yml │ ├── dmesg-a.load.nspcc.ru.log │ └── journalctl-current-boot-a.load.nspcc.ru.log ├── b.load.zip │ ├── b.load-config.yml │ ├── dmesg-b.load.nspcc.ru.log │ └── journalctl-current-boot-b.load.nspcc.ru.log ├── c.load.zip │ ├── c.load-config.yml │ ├── dmesg-c.load.nspcc.ru.log │ └── journalctl-current-boot-c.load.nspcc.ru.log ├── d.load.zip │ ├── d.load-config.yml │ ├── dmesg-d.load.nspcc.ru.log │ └── journalctl-current-boot-d.load.nspcc.ru.log
Links for download: a.load.zip
b.load.zip
c.load.zip
d.load.zip