Distributed database messaging hangs when storage in INTERNAL_ERROR state

timw commented 2 years ago

OrientDB Version: 3.1.17

When the storage in a distributed database is placed into an internal error state, any subsequent mutation request will cause the message processing in ODistributedDatabaseImpl to hang (i.e. be indefinitely suspended) for processing messages from the sending node.

We have run into at least 2 scenarios that trigger this behaviour:

When a Lucene full text index is corrupted by an unclean server shutdown (storage goes to INTERNAL_ERROR on startup)
When an unexpected exception is thrown during database operations (e.g. an index engine update) that causes the storage to go to INTERNAL_ERROR state.

Actual behavior

The steps required to hang the messaging are:

the 2pc commit is begun in ODatabaseDocumentDistributed#commit2pc
at some point, an unexpected exception is thrown in the storage/indexing, which as a side-effect causes the storage to be set into the INTERNAL_ERROR state
this exception is caught and an installDatabase invocation is scheduled
at some point later ODistributedAbstractPlugin#installDatabase is invoked, and suspends message processing by calling ODistributedDatabaseImpl#suspend
- at some point the storage is checked to see if it is in INTERNAL_ERROR state (OAbstractPaginatedStorage#checkErrorState)
- the storage is in an error state, and so throws an OStorageException
- installDatabase fails to handle the exception, and fails to resume message processing

These steps occur on the first failure, but also subsequent failures (which throw an exception wrapping the original failure). In my testing, the symptoms of the hang are visible after the second mutation attempt (after the first one failed), but I haven't debugged to see why they don't occur after the first.

On the sending side, the message (and all subsequent messages) fail with timeout errors (e.g. gossip timeouts) and the failed node must be restarted to resolve the issue.

Steps to reproduce

I can reproduce this 100% with an instrumented build that simulates an unexpected exception during a storage operation (e.g. the AlreadyClosedException reported in #9814), controlling activation on one node in the distributed deployment using environment variables.

Expected behavior

There are a variety of ways the primary symptoms could be fixed, but we wanted to discuss with the maintainers before implementing one or the other.

We could simply check if the storage is in an internal error state at the start of any installDatabase call. This would fix the immediate problem, but wouldn't improve the overall situation, and might miss other cases where unexpected exceptions can be thrown during an installDatabase call. Similarly we can handle OStorageException and resume message handling (similar results to above), or simply resume message handling in a finally, in `installDatabase (more thorough, but it's unclear from the code if that's safe/sensible).

A broader question is whether the storage moving to an INTERNAL_ERROR state should take the entire node offline in a distributed database.

tglman commented 2 years ago

Hi,

A storage going in INTERNAL_ERROR when running on a distributed setup, should trigger a full sync, that is more or less what's happening, and yes the error should be ignored/ skipped during the install phase, the storage should be closed, removed, and replaced by another one coming from another node.

Regards

tglman commented 2 years ago

Hi,

This should be fixed and released in both 3.1.x series and 3.2.x series.

Regards

orientechnologies / orientdb