Open mxtartaglia-sl opened 2 months ago
I found information that suggest this issue already occurred and it was deemed not a problem.
in this conversation: https://swirldslabs.slack.com/archives/C03KD8K1XAS/p1696501073167409 @lpetrovic05 crates this issue related to the same failure: https://github.com/hashgraph/hedera-services/issues/9040
In this conversation: https://swirldslabs.slack.com/archives/C06RA5SS9V3/p1713218839740149
Cody says: 2 ISSes when we only expected 1 is a flaw in the design of the test and not a signal of a real platform error
The failure to sign the state is because we got unlucky, and a state snapshot landed close to the time when we needed to reconnect.
and suggests no action
in this conversation: https://swirldslabs.slack.com/archives/C06RA5SS9V3/p1713453277513519 @edward-swirldslabs comments that the issue was encounter in the past and that the issue is assumed benign.
~~In this conversation: https://swirldslabs.slack.com/archives/C03KD8K1XAS/p1709136442982589 The latest ISS detector rewiring is mentioned.~~ NOT RELATED
It is not clear to me if the Iss detector is enabled in mainet.
I requested Austin assistance and to confirm if it is a not issue.
@litt3 points out that this failure could actually be a problem
ISS detection is definitely enabled on mainnet.
The goal of the test is to see if a node that has a self ISS can recover from it. The test should be analyzed to understand how it is trying to accomplish that goal.
There is evidence that this failure has occurred in past tests for a long time. So, it reduces the urgency. But with that said, More analysis is needed to explain why it happened in this particular run.
Cody's explanation that the test is expecting a reconnect doesn't seem to explain this failure, since I looked at other recent test runs, and these also have no reconnect, but they passed.
Here is how the current ISS recovery works: When we encounter an ISS that is recoverable, we write down the round number in the scratchpad and end the process. After startup up again, we read the scratchpad and determine that we have ISSed before, we choose to ignore state signatures from PCES because we expect they will have a bad signature. Once we start gossiping, we validate signatures regularly.
The issue seem to be that we create a signature pre-restart that gets gossiped, but does not make into the PCES. So after restart, we receive this event from gossip and we don't ignore it, even though we should.
Node3 is supposed to recover from an ISS but it didn't.
Summary
Run: GCP-Daily-ISS-4N Test:
ISS-recoverable-1k-5m
Duration: 00:08:46.690 Configured Test Duration: 00:05:00.0 CI Environment:Github Actions
Results Folder: http://35.247.76.217:8095/swirlds-automation/release/0.53/4N/ISS/20240903-104948-GCP-Daily-ISS-4N/ISS-recoverable-1k-5m Git: commit 1397e8a0e4a6a97089a50fec5d4eb7be362f0cf2 (HEAD -> release/0.53, tag: v0.53.5, origin/release/0.53) - Thu Aug 29 12:46:57 2024 -0400Report