hashgraph / hedera-services

Crypto, token, consensus, file, and smart contract services for the Hedera public ledger
Apache License 2.0
313 stars 136 forks source link

Nodes failed to freeze in JRS test `Crypto-Update-Setting-Config-1.5k-25m` #10846

Closed litt3 closed 9 months ago

litt3 commented 10 months ago

data folder

Validator failure:

---- UpdateReconnectTestStepValidator FAILED validation ----
Error at step: Waiting for 2nd restart finished
litt3 commented 9 months ago

Duplicate of https://github.com/hashgraph/hedera-services/issues/9317

edward-swirldslabs commented 9 months ago

I don't think this one is properly a dupe. And the ticket it is marked a duplicate of is not investigable since the links to the JRS tests no longer work.

edward-swirldslabs commented 9 months ago

Same test, different but possible related error message:

---- UpdateReconnectTestStepValidator FAILED validation ----
Error at step: Waiting for restart finished of 1st update

http://35.247.76.217:8095/swirlds-automation/release/0.46/4N_2C/Update/20240125-104159-GCP-Daily-Crypto-Update-4N-2C/Crypto-Update-Setting-Config-1.5k-25m/summary.txt

edward-swirldslabs commented 9 months ago

The Problem: The test script is written to expect a single ACTIVE status as a trigger for starting the client and report an error if more than one is seen.

Probable Cause: The PCES replay was causing self events to come to consensus which triggers the movement into ACTIVE platform status. But the node is still establishing Gossip. It is taking longer than 10s for self-created events to reach consensus and the status falls back into CHECKING and then switches back to ACTIVE after the gossiped self-events reach consensus.

Develop already has a commit merged for flushing transaction handling before switching out of PCES replay. This is thought to mitigate what we are seeing here.

No fix for cut release branches.

If this issue persists, the recommended course of action is to change the test executor/validator to not be sensitive to falling into CHECKING at the start of a node. Proceed with sending transactions from the client and if they fail to be handled in the expected time frame, that is the failure condition for these tests, not the platform status.