hashgraph / hedera-services

Crypto, token, consensus, file, and smart contract services for the Hedera public ledger
Apache License 2.0
316 stars 138 forks source link

09/03 - Release 53 - ISS-recoverable-1k-5m - Node3 is supposed to recover from an ISS but it didn't #15354

Open mxtartaglia-sl opened 2 months ago

mxtartaglia-sl commented 2 months ago

Node3 is supposed to recover from an ISS but it didn't.

Summary

Run: GCP-Daily-ISS-4N Test: ISS-recoverable-1k-5m Duration: 00:08:46.690 Configured Test Duration: 00:05:00.0 CI Environment: Github Actions Results Folder: http://35.247.76.217:8095/swirlds-automation/release/0.53/4N/ISS/20240903-104948-GCP-Daily-ISS-4N/ISS-recoverable-1k-5m Git: commit 1397e8a0e4a6a97089a50fec5d4eb7be362f0cf2 (HEAD -> release/0.53, tag: v0.53.5, origin/release/0.53) - Thu Aug 29 12:46:57 2024 -0400

---- ErrorValidator FAILED validation ----
(Node 3) Multiple (2) ISS's observed
---- PlatformStatusValidator FAILED validation ----
Node 3: Transitioning to CATASTROPHIC_FAILURE from OBSERVING is not a valid status for step 5 [StatusProgressionStep{targetStatus=CHECKING, optionalInterimStatuses=[RECONNECT_COMPLETE, BEHIND], requiredInterimStatuses=[OBSERVING], observedStatuses=[CATASTROPHIC_FAILURE, OBSERVING]}]

Report

Panel Test Name Owner Age Status History Summary Logs Metrics Data Notes
ISS ISS-recoverable-1k-5m platform 8.3h FAIL PPPPPP summary logs metrics data notes
mxtartaglia-sl commented 2 months ago

I found information that suggest this issue already occurred and it was deemed not a problem.

It is not clear to me if the Iss detector is enabled in mainet.

I requested Austin assistance and to confirm if it is a not issue.

mxtartaglia-sl commented 2 months ago

@litt3 points out that this failure could actually be a problem

ISS detection is definitely enabled on mainnet.

The goal of the test is to see if a node that has a self ISS can recover from it. The test should be analyzed to understand how it is trying to accomplish that goal.

There is evidence that this failure has occurred in past tests for a long time. So, it reduces the urgency. But with that said, More analysis is needed to explain why it happened in this particular run.

Cody's explanation that the test is expecting a reconnect doesn't seem to explain this failure, since I looked at other recent test runs, and these also have no reconnect, but they passed. 
lpetrovic05 commented 2 months ago

Background

Here is how the current ISS recovery works: When we encounter an ISS that is recoverable, we write down the round number in the scratchpad and end the process. After startup up again, we read the scratchpad and determine that we have ISSed before, we choose to ignore state signatures from PCES because we expect they will have a bad signature. Once we start gossiping, we validate signatures regularly.

The issue (I think)

The issue seem to be that we create a signature pre-restart that gets gossiped, but does not make into the PCES. So after restart, we receive this event from gossip and we don't ignore it, even though we should.

Solution

  1. Implement the proposal to write self-events to PCES before gossiping them
  2. Find another way of differentiating pre-restart and post-restart signatures