nspcc-dev / neofs-node

NeoFS is a decentralized distributed object storage integrated with the Neo blockchain
https://fs.neo.org
GNU General Public License v3.0
31 stars 38 forks source link

API clients cache audit #2701

Open carpawell opened 8 months ago

carpawell commented 8 months ago

Is your feature request related to a problem? Please describe.

I'm always frustrated when we need to figure out why some unexpected replication (and network communications in general) started in our networks. To support failover scenarios, we added reconnection interval that stops a node from reconnection for the next 30s. That is, if some connection fault happens, a node does not communicate with a remote node and thinks that its objects are unavailable, moreover, it immediately drops the connection and does not allow it to finish the other operation via the underlying shared client.

Describe the solution you'd like

I do not have a ready-to-go solution but have some thoughts to try:

  1. allow more tries to reconnect before doing a huge communication stop (seconds of unavailability)
  2. try to measure how important failover scenarios are for us now (we do not have any load tests, no numbers and the current solution is just a legacy that says "dead node may drop down the performance more than you want")
  3. refactor that damn dynamic thing that relies on the interface cast, i believe there should not be such a fix: https://github.com/nspcc-dev/neofs-node/pull/2249
  4. Use really big timeouts for background operations, like 15s or more. if a node can not handle replication requests for 15s, that is a good reason to choose another node and think that we have a dead one storage; but if we were dialing for smth like 10s (because of any reason), that is OK for replication, that is just trying to do a background optimization
  5. ...

Describe alternatives you've considered

Caching and reconnection intervals could be not a reason for network communication problems. Can not find any proof but debugging is in progress.

Additional context

https://github.com/nspcc-dev/neofs-node/pull/2694#discussion_r1433853216