Closed timw closed 2 years ago
Hi,
A storage going in INTERNAL_ERROR
when running on a distributed setup, should trigger a full sync, that is more or less what's happening, and yes the error should be ignored/ skipped during the install phase, the storage should be closed, removed, and replaced by another one coming from another node.
Regards
Hi,
This should be fixed and released in both 3.1.x series and 3.2.x series.
Regards
OrientDB Version: 3.1.17
When the storage in a distributed database is placed into an internal error state, any subsequent mutation request will cause the message processing in
ODistributedDatabaseImpl
to hang (i.e. be indefinitely suspended) for processing messages from the sending node.We have run into at least 2 scenarios that trigger this behaviour:
INTERNAL_ERROR
on startup)INTERNAL_ERROR
state.Actual behavior
The steps required to hang the messaging are:
ODatabaseDocumentDistributed#commit2pc
INTERNAL_ERROR
stateinstallDatabase
invocation is scheduledODistributedAbstractPlugin#installDatabase
is invoked, and suspends message processing by callingODistributedDatabaseImpl#suspend
INTERNAL_ERROR
state (OAbstractPaginatedStorage#checkErrorState
)OStorageException
installDatabase
fails to handle the exception, and fails to resume message processingThese steps occur on the first failure, but also subsequent failures (which throw an exception wrapping the original failure). In my testing, the symptoms of the hang are visible after the second mutation attempt (after the first one failed), but I haven't debugged to see why they don't occur after the first.
On the sending side, the message (and all subsequent messages) fail with timeout errors (e.g. gossip timeouts) and the failed node must be restarted to resolve the issue.
Steps to reproduce
I can reproduce this 100% with an instrumented build that simulates an unexpected exception during a storage operation (e.g. the
AlreadyClosedException
reported in #9814), controlling activation on one node in the distributed deployment using environment variables.Expected behavior
There are a variety of ways the primary symptoms could be fixed, but we wanted to discuss with the maintainers before implementing one or the other.
We could simply check if the storage is in an internal error state at the start of any
installDatabase
call. This would fix the immediate problem, but wouldn't improve the overall situation, and might miss other cases where unexpected exceptions can be thrown during aninstallDatabase
call. Similarly we can handleOStorageException
and resume message handling (similar results to above), or simply resume message handling in afinally
, in `installDatabase (more thorough, but it's unclear from the code if that's safe/sensible).A broader question is whether the storage moving to an
INTERNAL_ERROR
state should take the entire node offline in a distributed database.