Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.
Describe the bug
We had an incident with one node on our Service Fabric cluster. The system date of the server was changed to the future. We fixed this issue, but Service Fabric still has an OK event in the future that cause an error on the partition that run the system service fabric:/System/FailoverManagerService.
On the 14/08/2024 one of the node had the date changed to 29/09/2024 for about 5 hours before fixing the date. The node is a normal node (not seed).
Error :
PS D:\> Get-ServiceFabricPartitionHealth
cmdlet Get-ServiceFabricPartitionHealth at command pipeline position 1
Supply values for the following parameters:
PartitionId : 00000000-0000-0000-0000-000000000001
AggregatedHealthState : Error
UnhealthyEvaluations :
The OK reported by 'System.FMM' for property 'State' is expired. The report was applied at 2024-08-14 01:00:27.218 with TTL 15:00.000.
Partition is healthy.
ReplicaHealthStates :
ReplicaId : 132601204022355426
AggregatedHealthState : Ok
ReplicaId : 132601204008971912
AggregatedHealthState : Ok
ReplicaId : 132601204022355427
AggregatedHealthState : Ok
HealthEvents :
SourceId : System.FMM
Property : State
HealthState : Ok
SequenceNumber : 133720681240404006
SentAt : 29/09/2024 07:22:04
ReceivedAt : 14/08/2024 01:00:27
TTL : 00:15:00
RemoveWhenExpired : False
IsExpired : True
HealthReportID : FMM_7.0_1009
Transitions : Warning->Ok = 12/07/2024 22:50:20, LastError = 01/01/0001 00:00:00
HealthStatistics :
Replica : 3 Ok, 0 Warning, 0 Error
Area/Component:
Partition that run the system service fabric:/System/FailoverManagerService
To Reproduce
Steps to reproduce the behavior:
Update the date of one node fare to the future.
The cluster should receive events from this node with the date on the future.
Fix the date on this node.
The partition will have the error : The OK reported by 'System.FMM' for property 'State' is expired...
Expected behavior
Fixing the date on the node should generate new events that fixes the error event.
Observed behavior:
The cluster status is error. This block Service Fabric package updates.
Screenshots
Service Fabric Runtime Version:
9.1.1390.9590
Environment:
Standalone
OS: Windows Server 2016
Version 9.1.1390.9590
If this is a regression, which version did it regress from?
Additional context
We tried to restart VMs.
We tried also to send a partition health report Send-ServiceFabricPartitionHealthReport -PartitionId 00000000-0000-0000-0000-000000000001 -SourceId "System.FMM" -HealthProperty "State" -HealthState Ok -TimeToLiveSec 30 -RemoveWhenExpired
Also a repair : Repair-ServiceFabricPartition -PartitionId 00000000-0000-0000-0000-000000000001
Describe the bug We had an incident with one node on our Service Fabric cluster. The system date of the server was changed to the future. We fixed this issue, but Service Fabric still has an OK event in the future that cause an error on the partition that run the system service fabric:/System/FailoverManagerService.
On the 14/08/2024 one of the node had the date changed to 29/09/2024 for about 5 hours before fixing the date. The node is a normal node (not seed).
Error :
Area/Component: Partition that run the system service fabric:/System/FailoverManagerService
To Reproduce Steps to reproduce the behavior:
Expected behavior Fixing the date on the node should generate new events that fixes the error event.
Observed behavior: The cluster status is error. This block Service Fabric package updates.
Screenshots
Service Fabric Runtime Version: 9.1.1390.9590
Environment:
If this is a regression, which version did it regress from?
Additional context We tried to restart VMs. We tried also to send a partition health report
Send-ServiceFabricPartitionHealthReport -PartitionId 00000000-0000-0000-0000-000000000001 -SourceId "System.FMM" -HealthProperty "State" -HealthState Ok -TimeToLiveSec 30 -RemoveWhenExpired
Also a repair :Repair-ServiceFabricPartition -PartitionId 00000000-0000-0000-0000-000000000001
There's no way we can reset the event status
Assignees: /cc @microsoft/service-fabric-triage