Open roman-khimov opened 1 month ago
We have nonblocking pools for years already: https://github.com/nspcc-dev/neofs-node/blob/7365c7c5e6afb4f4a29fd75cfc17dfb6bb971951/pkg/innerring/processors/netmap/processor.go#L125
The same way as in https://github.com/nspcc-dev/neofs-node/issues/2871 i have no idea why it is done this way. Skipping operations should be done in the places where it is safe to lose some info/calls. I cannot even remember such places for us but still:
▶ grep -rni "ants.WithNonblocking" *
cmd/neofs-node/config.go:742: optNonBlocking := ants.WithNonblocking(true)
pkg/morph/event/listener.go:592: pool, err := ants.NewPool(poolCap, ants.WithNonblocking(true))
pkg/local_object_storage/engine/shards.go:120: pool, err := ants.NewPool(int(e.shardPoolSize), ants.WithNonblocking(true))
pkg/local_object_storage/engine/engine_test.go:87: pool, err := ants.NewPool(10, ants.WithNonblocking(true))
pkg/innerring/processors/netmap/processor.go:125: pool, err := ants.NewPool(p.PoolSize, ants.WithNonblocking(true))
pkg/innerring/processors/settlement/processor.go:60: pool, err := ants.NewPool(o.poolSize, ants.WithNonblocking(true))
pkg/innerring/processors/audit/processor.go:103: pool, err := ants.NewPool(ProcessorPoolSize, ants.WithNonblocking(true))
pkg/innerring/processors/balance/processor.go:64: pool, err := ants.NewPool(p.PoolSize, ants.WithNonblocking(true))
pkg/innerring/processors/reputation/processor.go:68: pool, err := ants.NewPool(p.PoolSize, ants.WithNonblocking(true))
pkg/innerring/processors/container/processor.go:78: pool, err := ants.NewPool(p.PoolSize, ants.WithNonblocking(true))
pkg/innerring/processors/alphabet/processor.go:70: pool, err := ants.NewPool(p.PoolSize, ants.WithNonblocking(true))
pkg/innerring/processors/governance/processor.go:109: pool, err := ants.NewPool(ProcessorPoolSize, ants.WithNonblocking(true))
pkg/innerring/processors/neofs/processor.go:102: pool, err := ants.NewPool(p.PoolSize, ants.WithNonblocking(true))
The only real problem i remember is a panic fear of blocking neo-go client: many notification handlings require additional RPC requests, so if we are blocked on making a new RPC, we cannot finish notification handling.
Is your feature request related to a problem? Please describe.
I'm always frustrated when nodes can't enter the netmap for no real reason.
The way pools are configured for various events it can happen easily. N nodes simultaneously trying to enter the netmap (or refreshing their presence!) can create a transaction spike that just can't be processed currently even though it's a small number of events (like a hundred).
Describe the solution you'd like
Queues, blocking, up to dropping the pool completely since it doesn't make much sense to me in this context.
Describe alternatives you've considered
Raising worker count is not a real solution.
Additional context
100+ nodes test.