microsoft / service-fabric

Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.
https://docs.microsoft.com/en-us/azure/service-fabric/
MIT License
3.03k stars 401 forks source link

[BUG] Wrong date on one node creates an error expired event which can't be deleted #1508

Open shessane opened 3 months ago

shessane commented 3 months ago

Describe the bug We had an incident with one node on our Service Fabric cluster. The system date of the server was changed to the future. We fixed this issue, but Service Fabric still has an OK event in the future that cause an error on the partition that run the system service fabric:/System/FailoverManagerService.

On the 14/08/2024 one of the node had the date changed to 29/09/2024 for about 5 hours before fixing the date. The node is a normal node (not seed).

Error :

PS D:\> Get-ServiceFabricPartitionHealth

cmdlet Get-ServiceFabricPartitionHealth at command pipeline position 1
Supply values for the following parameters:

PartitionId           : 00000000-0000-0000-0000-000000000001
AggregatedHealthState : Error
UnhealthyEvaluations  :
                        The OK reported by 'System.FMM' for property 'State' is expired. The report was applied at 2024-08-14 01:00:27.218 with TTL 15:00.000.
                        Partition is healthy.

ReplicaHealthStates   :
                        ReplicaId             : 132601204022355426
                        AggregatedHealthState : Ok

                        ReplicaId             : 132601204008971912
                        AggregatedHealthState : Ok

                        ReplicaId             : 132601204022355427
                        AggregatedHealthState : Ok

HealthEvents          :
                        SourceId              : System.FMM
                        Property              : State
                        HealthState           : Ok
                        SequenceNumber        : 133720681240404006
                        SentAt                : 29/09/2024 07:22:04
                        ReceivedAt            : 14/08/2024 01:00:27
                        TTL                   : 00:15:00
                        RemoveWhenExpired     : False
                        IsExpired             : True
                        HealthReportID        : FMM_7.0_1009
                        Transitions           : Warning->Ok = 12/07/2024 22:50:20, LastError = 01/01/0001 00:00:00

HealthStatistics      :
                        Replica               : 3 Ok, 0 Warning, 0 Error

Area/Component: Partition that run the system service fabric:/System/FailoverManagerService

To Reproduce Steps to reproduce the behavior:

  1. Update the date of one node fare to the future.
  2. The cluster should receive events from this node with the date on the future.
  3. Fix the date on this node.
  4. The partition will have the error : The OK reported by 'System.FMM' for property 'State' is expired...

Expected behavior Fixing the date on the node should generate new events that fixes the error event.

Observed behavior: The cluster status is error. This block Service Fabric package updates.

Screenshots image

Service Fabric Runtime Version: 9.1.1390.9590

Environment:

If this is a regression, which version did it regress from?

Additional context We tried to restart VMs. We tried also to send a partition health report Send-ServiceFabricPartitionHealthReport -PartitionId 00000000-0000-0000-0000-000000000001 -SourceId "System.FMM" -HealthProperty "State" -HealthState Ok -TimeToLiveSec 30 -RemoveWhenExpired Also a repair : Repair-ServiceFabricPartition -PartitionId 00000000-0000-0000-0000-000000000001

There's no way we can reset the event status


Assignees: /cc @microsoft/service-fabric-triage

dribblor commented 1 month ago

Same here. It is present on our 10.0.1949.9590 single node cluster and is still there after updating it to 10.1.2338.9590.

image