nspcc-dev / neofs-node

NeoFS is a decentralized distributed object storage integrated with the Neo blockchain
https://fs.neo.org
GNU General Public License v3.0
32 stars 38 forks source link

Rework pools for contract events #2961

Open roman-khimov opened 1 month ago

roman-khimov commented 1 month ago

Is your feature request related to a problem? Please describe.

I'm always frustrated when nodes can't enter the netmap for no real reason.

2024-10-03T15:33:41.721Z    warn    netmap/handlers.go:56   netmap worker pool drained  {"capacity": 10}

The way pools are configured for various events it can happen easily. N nodes simultaneously trying to enter the netmap (or refreshing their presence!) can create a transaction spike that just can't be processed currently even though it's a small number of events (like a hundred).

Describe the solution you'd like

Queues, blocking, up to dropping the pool completely since it doesn't make much sense to me in this context.

Describe alternatives you've considered

Raising worker count is not a real solution.

Additional context

100+ nodes test.

carpawell commented 1 month ago

We have nonblocking pools for years already: https://github.com/nspcc-dev/neofs-node/blob/7365c7c5e6afb4f4a29fd75cfc17dfb6bb971951/pkg/innerring/processors/netmap/processor.go#L125

The same way as in https://github.com/nspcc-dev/neofs-node/issues/2871 i have no idea why it is done this way. Skipping operations should be done in the places where it is safe to lose some info/calls. I cannot even remember such places for us but still:

▶ grep -rni "ants.WithNonblocking" *
cmd/neofs-node/config.go:742:   optNonBlocking := ants.WithNonblocking(true)
pkg/morph/event/listener.go:592:        pool, err := ants.NewPool(poolCap, ants.WithNonblocking(true))
pkg/local_object_storage/engine/shards.go:120:  pool, err := ants.NewPool(int(e.shardPoolSize), ants.WithNonblocking(true))
pkg/local_object_storage/engine/engine_test.go:87:              pool, err := ants.NewPool(10, ants.WithNonblocking(true))
pkg/innerring/processors/netmap/processor.go:125:       pool, err := ants.NewPool(p.PoolSize, ants.WithNonblocking(true))
pkg/innerring/processors/settlement/processor.go:60:    pool, err := ants.NewPool(o.poolSize, ants.WithNonblocking(true))
pkg/innerring/processors/audit/processor.go:103:        pool, err := ants.NewPool(ProcessorPoolSize, ants.WithNonblocking(true))
pkg/innerring/processors/balance/processor.go:64:       pool, err := ants.NewPool(p.PoolSize, ants.WithNonblocking(true))
pkg/innerring/processors/reputation/processor.go:68:    pool, err := ants.NewPool(p.PoolSize, ants.WithNonblocking(true))
pkg/innerring/processors/container/processor.go:78:     pool, err := ants.NewPool(p.PoolSize, ants.WithNonblocking(true))
pkg/innerring/processors/alphabet/processor.go:70:      pool, err := ants.NewPool(p.PoolSize, ants.WithNonblocking(true))
pkg/innerring/processors/governance/processor.go:109:   pool, err := ants.NewPool(ProcessorPoolSize, ants.WithNonblocking(true))
pkg/innerring/processors/neofs/processor.go:102:        pool, err := ants.NewPool(p.PoolSize, ants.WithNonblocking(true))

The only real problem i remember is a panic fear of blocking neo-go client: many notification handlings require additional RPC requests, so if we are blocked on making a new RPC, we cannot finish notification handling.