[BUG] Wrong date on one node creates an error expired event which can't be deleted

Describe the bug We had an incident with one node on our Service Fabric cluster. The system date of the server was changed to the future. We fixed this issue, but Service Fabric still has an OK event in the future that cause an error on the partition that run the system service fabric:/System/FailoverManagerService.

On the 14/08/2024 one of the node had the date changed to 29/09/2024 for about 5 hours before fixing the date. The node is a normal node (not seed).

Error :

PS D:\> Get-ServiceFabricPartitionHealth

cmdlet Get-ServiceFabricPartitionHealth at command pipeline position 1
Supply values for the following parameters:

PartitionId           : 00000000-0000-0000-0000-000000000001
AggregatedHealthState : Error
UnhealthyEvaluations  :
                        The OK reported by 'System.FMM' for property 'State' is expired. The report was applied at 2024-08-14 01:00:27.218 with TTL 15:00.000.
                        Partition is healthy.

ReplicaHealthStates   :
                        ReplicaId             : 132601204022355426
                        AggregatedHealthState : Ok

                        ReplicaId             : 132601204008971912
                        AggregatedHealthState : Ok

                        ReplicaId             : 132601204022355427
                        AggregatedHealthState : Ok

HealthEvents          :
                        SourceId              : System.FMM
                        Property              : State
                        HealthState           : Ok
                        SequenceNumber        : 133720681240404006
                        SentAt                : 29/09/2024 07:22:04
                        ReceivedAt            : 14/08/2024 01:00:27
                        TTL                   : 00:15:00
                        RemoveWhenExpired     : False
                        IsExpired             : True
                        HealthReportID        : FMM_7.0_1009
                        Transitions           : Warning->Ok = 12/07/2024 22:50:20, LastError = 01/01/0001 00:00:00

HealthStatistics      :
                        Replica               : 3 Ok, 0 Warning, 0 Error

Area/Component: Partition that run the system service fabric:/System/FailoverManagerService

To Reproduce Steps to reproduce the behavior:

Update the date of one node fare to the future.
The cluster should receive events from this node with the date on the future.
Fix the date on this node.
The partition will have the error : The OK reported by 'System.FMM' for property 'State' is expired...

Expected behavior Fixing the date on the node should generate new events that fixes the error event.

Observed behavior: The cluster status is error. This block Service Fabric package updates.

Screenshots

Service Fabric Runtime Version: 9.1.1390.9590

Environment:

Standalone
OS: Windows Server 2016
Version 9.1.1390.9590

If this is a regression, which version did it regress from?

Additional context We tried to restart VMs. We tried also to send a partition health report Send-ServiceFabricPartitionHealthReport -PartitionId 00000000-0000-0000-0000-000000000001 -SourceId "System.FMM" -HealthProperty "State" -HealthState Ok -TimeToLiveSec 30 -RemoveWhenExpired Also a repair : Repair-ServiceFabricPartition -PartitionId 00000000-0000-0000-0000-000000000001

There's no way we can reset the event status

Assignees: /cc @microsoft/service-fabric-triage

microsoft / service-fabric

[BUG] Wrong date on one node creates an error expired event which can't be deleted #1508