09/03 - Release 53 - ISS-recoverable-1k-5m - Node3 is supposed to recover from an ISS but it didn't

mxtartaglia-sl commented 2 months ago

Node3 is supposed to recover from an ISS but it didn't.

Summary

Run: GCP-Daily-ISS-4N Test: ISS-recoverable-1k-5m Duration: 00:08:46.690 Configured Test Duration: 00:05:00.0 CI Environment: Github Actions Results Folder: http://35.247.76.217:8095/swirlds-automation/release/0.53/4N/ISS/20240903-104948-GCP-Daily-ISS-4N/ISS-recoverable-1k-5m Git: commit 1397e8a0e4a6a97089a50fec5d4eb7be362f0cf2 (HEAD -> release/0.53, tag: v0.53.5, origin/release/0.53) - Thu Aug 29 12:46:57 2024 -0400

---- ErrorValidator FAILED validation ----
(Node 3) Multiple (2) ISS's observed

---- PlatformStatusValidator FAILED validation ----
Node 3: Transitioning to CATASTROPHIC_FAILURE from OBSERVING is not a valid status for step 5 [StatusProgressionStep{targetStatus=CHECKING, optionalInterimStatuses=[RECONNECT_COMPLETE, BEHIND], requiredInterimStatuses=[OBSERVING], observedStatuses=[CATASTROPHIC_FAILURE, OBSERVING]}]

Report

Panel	Test Name	Owner	Age	Status	History	Summary	Logs	Metrics	Data	Notes
ISS	ISS-recoverable-1k-5m	platform	8.3h	FAIL	P P P P P P	summary	logs	metrics data	notes

mxtartaglia-sl commented 2 months ago

I found information that suggest this issue already occurred and it was deemed not a problem.

in this conversation: https://swirldslabs.slack.com/archives/C03KD8K1XAS/p1696501073167409 @lpetrovic05 crates this issue related to the same failure: https://github.com/hashgraph/hedera-services/issues/9040
In this conversation: https://swirldslabs.slack.com/archives/C06RA5SS9V3/p1713218839740149 Cody says: 2 ISSes when we only expected 1 is a flaw in the design of the test and not a signal of a real platform error The failure to sign the state is because we got unlucky, and a state snapshot landed close to the time when we needed to reconnect. and suggests no action
in this conversation: https://swirldslabs.slack.com/archives/C06RA5SS9V3/p1713453277513519 @edward-swirldslabs comments that the issue was encounter in the past and that the issue is assumed benign.
~~In this conversation: https://swirldslabs.slack.com/archives/C03KD8K1XAS/p1709136442982589 The latest ISS detector rewiring is mentioned.~~ NOT RELATED

It is not clear to me if the Iss detector is enabled in mainet.

I requested Austin assistance and to confirm if it is a not issue.

mxtartaglia-sl commented 2 months ago

@litt3 points out that this failure could actually be a problem

ISS detection is definitely enabled on mainnet.

The goal of the test is to see if a node that has a self ISS can recover from it. The test should be analyzed to understand how it is trying to accomplish that goal.

There is evidence that this failure has occurred in past tests for a long time. So, it reduces the urgency. But with that said, More analysis is needed to explain why it happened in this particular run.

Cody's explanation that the test is expecting a reconnect doesn't seem to explain this failure, since I looked at other recent test runs, and these also have no reconnect, but they passed.

lpetrovic05 commented 2 months ago

Background

Here is how the current ISS recovery works: When we encounter an ISS that is recoverable, we write down the round number in the scratchpad and end the process. After startup up again, we read the scratchpad and determine that we have ISSed before, we choose to ignore state signatures from PCES because we expect they will have a bad signature. Once we start gossiping, we validate signatures regularly.

The issue (I think)

The issue seem to be that we create a signature pre-restart that gets gossiped, but does not make into the PCES. So after restart, we receive this event from gossip and we don't ignore it, even though we should.

Solution

Implement the proposal to write self-events to PCES before gossiping them
Find another way of differentiating pre-restart and post-restart signatures

hashgraph / hedera-services