microsoft / service-fabric

Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.
https://docs.microsoft.com/en-us/azure/service-fabric/
MIT License
3.03k stars 399 forks source link

ETL Failure With Performance Counters #663

Open ChesneyMark opened 5 years ago

ChesneyMark commented 5 years ago

Below is an exception we intermittently see in the Microsoft diagnostics. When this exception occurs the node (or nodes as it can happen on multiple nodes) become unstable with the cluster going out of balance until reliable storage fails. The only way that we can restore the cluster is by first manually stopping the actual performance counter on the affected node then starting it again which seems to delete the underlying ETL file and then creates a new one. The is a standalone cluster running version 6.3.162.9494 of Service Fabric. A restart of the server or restarting the affected node via the Service Fabric dashboard does not resolve this issue.

Failed to read some or all of the events from ETL file C:\ProgramData\SF\Log\OperationalTraces\operational_traces_6.3.162.9494_131868298096256810_1.etl. System.ComponentModel.Win32Exception (0x80004005): The handle is invalid at Tools.EtlReader.TraceFileEventReader.ReadEvents(DateTime startTime, DateTime endTime) at System.Fabric.Dca.Utility.PerformWithRetries[T](Action`1 worker, T context, RetriableOperationExceptionHandler exceptionHandler, Int32 initialRetryIntervalMs, Int32 maxRetryCount, Int32 maxRetryIntervalMs) at FabricDCA.EtlProcessor.ProcessActiveEtlFile(FileInfo etlFile, DateTime lastEndTime, DateTime& newEndTime, CancellationToken cancellationToken)

ChesneyMark commented 5 years ago

We have also had a similar issue today with two of our nodes. As well as the same failure mentioned in the original post we have also had query_traces fail as well causing the same results.

Failed to read some or all of the events from ETL file C:\ProgramData\SF\Log\QueryTraces\query_traces_6.3.162.9494_131868300071125898_13.etl. System.ComponentModel.Win32Exception (0x80004005): The handle is invalid at Tools.EtlReader.TraceFileEventReader.ReadEvents(DateTime startTime, DateTime endTime) at System.Fabric.Dca.Utility.PerformWithRetries[T](Action`1 worker, T context, RetriableOperationExceptionHandler exceptionHandler, Int32 initialRetryIntervalMs, Int32 maxRetryCount, Int32 maxRetryIntervalMs) at FabricDCA.EtlProcessor.ProcessActiveEtlFile(FileInfo etlFile, DateTime lastEndTime, DateTime& newEndTime, CancellationToken cancellationToken)

masnider commented 5 years ago

Most of the time when we see this it is correlated with a resource/capacity crunch on the machines. Can you look at the CPU and especially memory consumption on the machines and see if there's sufficient capacity? If there's very limited resources, then all sorts of random things will fail.

ChesneyMark commented 4 years ago

Sorry for a very long delay with this. The issue appears to be a race condition where the ETL archiving process has tried to move the file while it is still in use however it isn't smart enough to realise this failed. What then happens is performance monitor is still has a lock on the file until either we stop the counter using performance monitor allowing it to be deleted or we restart the node. There is no issue with either CPU or memory, and this happens on all of our different clusters and is still there in v7.x runtimes.

We do not however believe this is now causing the node to be unstable, just the loss of diagnostics which is preventing us to figure out the real problem which is as follows. We have found a race condition in the snapshots with reliable storage that when this fails on a timeout (we think caused by one of the SF background processes scanning and locking the reliable storage) all snapshots from that point on fail causing our Microservices to eat up more and more memory until the Microservice itself deadlocks. We are trying to see if we can reproduce this in a test lab using a load simulator, but we have had no luck. What we do know is this does not occur with actors with three replicas and does not occur with Stateful services with only one replica, but it does happen randomly with Stateful services with two or more replicas. It also occurs with cluster configurations of both three and five nodes, and again is still present in the v7 runtime.