orientechnologies / orientdb

OrientDB is the most versatile DBMS supporting Graph, Document, Reactive, Full-Text and Geospatial models in one Multi-Model product. OrientDB can run distributed (Multi-Master), supports SQL, ACID Transactions, Full-Text indexing and Reactive Queries.
https://orientdb.dev
Apache License 2.0
4.75k stars 871 forks source link

Distributed database messaging hangs when storage in INTERNAL_ERROR state #9815

Closed timw closed 2 years ago

timw commented 2 years ago

OrientDB Version: 3.1.17

When the storage in a distributed database is placed into an internal error state, any subsequent mutation request will cause the message processing in ODistributedDatabaseImpl to hang (i.e. be indefinitely suspended) for processing messages from the sending node.

We have run into at least 2 scenarios that trigger this behaviour:

Actual behavior

The steps required to hang the messaging are:

These steps occur on the first failure, but also subsequent failures (which throw an exception wrapping the original failure). In my testing, the symptoms of the hang are visible after the second mutation attempt (after the first one failed), but I haven't debugged to see why they don't occur after the first.

On the sending side, the message (and all subsequent messages) fail with timeout errors (e.g. gossip timeouts) and the failed node must be restarted to resolve the issue.

Steps to reproduce

I can reproduce this 100% with an instrumented build that simulates an unexpected exception during a storage operation (e.g. the AlreadyClosedException reported in #9814), controlling activation on one node in the distributed deployment using environment variables.

Expected behavior

There are a variety of ways the primary symptoms could be fixed, but we wanted to discuss with the maintainers before implementing one or the other.

We could simply check if the storage is in an internal error state at the start of any installDatabase call. This would fix the immediate problem, but wouldn't improve the overall situation, and might miss other cases where unexpected exceptions can be thrown during an installDatabase call. Similarly we can handle OStorageException and resume message handling (similar results to above), or simply resume message handling in a finally, in `installDatabase (more thorough, but it's unclear from the code if that's safe/sensible).

A broader question is whether the storage moving to an INTERNAL_ERROR state should take the entire node offline in a distributed database.

tglman commented 2 years ago

Hi,

A storage going in INTERNAL_ERROR when running on a distributed setup, should trigger a full sync, that is more or less what's happening, and yes the error should be ignored/ skipped during the install phase, the storage should be closed, removed, and replaced by another one coming from another node.

Regards

tglman commented 2 years ago

Hi,

This should be fixed and released in both 3.1.x series and 3.2.x series.

Regards