microsoft / service-fabric-issues

This repo is for the reporting of issues found with Azure Service Fabric.
168 stars 21 forks source link

System.FormatException in System.Fabric.Chaos.DataStructures.ChaosSchedule.Read #1003

Closed crowbar27 closed 6 years ago

crowbar27 commented 6 years ago

After reinstalling my service fabric cluster for the fourth time, it worked over the whole weekend and this morning. However, since noon FabricFAS.exe crashes every four minutes or so, leaving the following event log entries:

Event Source: Microsoft-Service Fabric Event ID: 62976 Details:

Exception 'System.FormatException: The DateTime represented by the string is out of range.
   at System.DateTimeParse.Parse(String s, DateTimeFormatInfo dtfi, DateTimeStyles styles)
   at System.Fabric.Chaos.DataStructures.ChaosSchedule.Read(BinaryReader br)
   at System.Fabric.Chaos.DataStructures.ChaosScheduleDescription.Read(BinaryReader br)
   at System.Fabric.ByteSerializable.FromBytes(Byte[] data)' occurred while de-serializing 'System.Fabric.Chaos.DataStructures.ChaosScheduleDescription'.

Followed by

Event Source: .NET Runtime Event ID: 1025 Details:

Application: FabricFAS.exe
Framework Version: v4.0.30319
Description: The application requested process termination through System.Environment.FailFast(string message).
Message: RunAsync failed due to an unhandled exception causing the host process to crash: System.FormatException: The DateTime represented by the string is out of range.
   at System.DateTimeParse.Parse(String s, DateTimeFormatInfo dtfi, DateTimeStyles styles)
   at System.Fabric.Chaos.DataStructures.ChaosSchedule.Read(BinaryReader br)
   at System.Fabric.Chaos.DataStructures.ChaosScheduleDescription.Read(BinaryReader br)
   at System.Fabric.ByteSerializable.FromBytes(Byte[] data)
   at System.Fabric.FaultAnalysis.Service.Chaos.ChaosScheduler.<TryRecoveryFromSchedule>d__34.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Fabric.FaultAnalysis.Service.Chaos.ChaosScheduler.<RestartRecoveryAsync>d__33.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Fabric.FaultAnalysis.Service.Chaos.ChaosScheduler.<InitializeAsync>d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Fabric.FaultAnalysis.Service.FaultAnalysisService.<InitializeAsync>d__21.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Fabric.FaultAnalysis.Service.FaultAnalysisService.<RunAsync>d__18.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.ServiceFabric.Services.Runtime.StatefulServiceReplicaAdapter.<ExecuteRunAsync>d__22.MoveNext()
Stack:
   at System.Environment.FailFast(System.String)
   at System.Threading.Tasks.Task.Execute()
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef)
   at System.Threading.Tasks.Task.ExecuteEntry(Boolean)
   at System.Threading.ThreadPoolWorkQueue.Dispatch()

Followed by

Event source: Application Error Event ID: 1000 Details:

Faulting application name: FabricFAS.exe, version: 6.2.262.9494, time stamp: 0x5ad0782a
Faulting module name: mscorlib.ni.dll, version: 4.7.2117.0, time stamp: 0x59cf513d
Exception code: 0x80131623
Fault offset: 0x000000000052d436
Faulting process id: 0xb6c
Faulting application start time: 0x01d3db0a6818ef71
Faulting application path: C:\ProgramData\SF\vesta1\Fabric\work\Applications\__FabricSystem_App4294967295\FAS.Code.Current\FabricFAS.exe
Faulting module path: C:\windows\assembly\NativeImages_v4.0.30319_64\mscorlib\6b278bb41b219b5d3ea584606329e448\mscorlib.ni.dll
Report Id: 8efd95ce-973e-4a0f-a5db-eda837506f9a
Faulting package full name: 
Faulting package-relative application ID: 

Any ideas what causes this issue? I am not aware of using any of the Chaos cmdlets ...

nembo81 commented 6 years ago

Hi, we have the same identical issue. We upgraded our 8 nodes cluster to the latest version (6.2.262.9494) last week and we noticed the issue after rebooting few nodes for windows updates. I already tried to restart all nodes,execute few powershell cmdlet like "Repair-ServiceFabricPartition -System" without luck. Is it a bug?

crowbar27 commented 6 years ago

I have also tried Repair-ServiceFabricPartition several times, but it seems not to help. I believe (but I am not completely sure) that the "Ready" partition moves from one node to the other sometime, but the other ones are always "down".

nembo81 commented 6 years ago

We tried It too but nothing happened.the service stop in every node in a infinite loop.

Il mar 24 apr 2018, 17:41 Christoph notifications@github.com ha scritto:

I have also tried Repair-ServiceFabricPartition several times, but it seems not to help. I believe (but I am not completely sure) that the "Ready" partition moves from one node to the other sometime, but the other ones are always "down".

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Azure/service-fabric-issues/issues/1003#issuecomment-383979743, or mute the thread https://github.com/notifications/unsubscribe-auth/Ah7EE_d0E-N8ggdZHcPmNEvVjLEXbdO8ks5tr0eSgaJpZM4Tf_pN .

stofte commented 6 years ago

I'm seeing the same thing. Cluster was fine for weeks, only having the traditional permanently "sick" DnsService (which seems to have no impact unless you use the obscure dns feature of SF and who would). Upgraded to latest version yesterday (I know, insane optimism), and now I'm seeing a sick FaultAnalysisService, which I assume is related to this "ChaosScheduler".

@Microsoft Please schedule less chaos on my dev machine.

anmolah commented 6 years ago

@motanv please take a look

likevi-MSFT commented 6 years ago

I'm looking into this issue right now.

crowbar27 commented 6 years ago

@nembo81 I noticed the Italian (?) response header at the bottom of your last post. Do you have an Italian date format configured on your cluster? Mine is German although the OS language is English, and I think we use the same formatting for dates.

likevi-MSFT commented 6 years ago

@crowbar27 @nembo81 @stofte Could you provide me some more information about the cluster that was experiencing this problem?

  1. Where was the cluster located? (local one box, Azure, standalone)
  2. What version did you upgrade from?
  3. Can you provide me access to the environment where this occurred?
  4. The format of the datetime for your local environment? @nembo81, @crowbar27
nembo81 commented 6 years ago

Hi, You are right,I have an eng win2016 with italian culture configured.I thought the same because that .NET error.I never changed it,neither tried to chance it now...did you?

Il mer 25 apr 2018, 21:06 Christoph notifications@github.com ha scritto:

@nembo81 https://github.com/nembo81 I noticed the Italian (?) response header at the bottom of your last post. Do you have an Italian date format configured on your cluster? Mine is German although the OS language is English, and I think we use the same formatting for dates.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Azure/service-fabric-issues/issues/1003#issuecomment-384399726, or mute the thread https://github.com/notifications/unsubscribe-auth/Ah7EE6H9ZIgK5BDaZ76EZgayFFcGYcnhks5tsMk2gaJpZM4Tf_pN .

likevi-MSFT commented 6 years ago

I've reproduced the error on my machine and I'm working on a fix now.

likevi-MSFT commented 6 years ago

@nembo81 @stofte @crowbar27 What kind of clusters are you using? (production, test, development) We are trying to understand the impact of this bug.

To anyone else facing this issue, please let me know as well.

likevi-MSFT commented 6 years ago

If you need a mitigation for this issue, please send me a private direct message.

nembo81 commented 6 years ago

@likevi-MSFT I Have an 8 nodes Dev-test cluster : 1) on premise 2) I upgraded from version 6.1.480.9494 3) If you still need access,we can do it. 4) Italian format , short date dd/MM/yyyy,long date dddd d MMMM yyyy.

likevi-MSFT commented 6 years ago

@nembo81 Thank you. We have a fix for this issue in the next release.

crowbar27 commented 6 years ago

@likevi-MSFT

I have a three-node cluster as installed by HPC Pack 2016 (although I had to re-install SF manually from the standalone package).

  1. On-prem
  2. Fresh install from http://go.microsoft.com/fwlink/?LinkId=730690
  3. Probably yes, but it would require some efforts.
  4. English UI, but German home location and format. The short date format is dd.MM.yyyy, long date format dddd, d. MMMM yyyy, short time HH:mm, long time HH:mm:ss.
likevi-MSFT commented 6 years ago

@crowbar27 Thank you. We have fixed the problem for the next release. It was a locale issue involving date formats.

All clusters on machines with locales that have date formats where the month come after the day are impacted.

If you need a mitigation to this issue message me directly and I will walk you through the steps.

crowbar27 commented 6 years ago

@likevi-MSFT That depends on whether the steps are other than changing the server's locale settings (I know how to do that) and on the result of another enquiry I have. The HPC Pack 2016 application we have installed in the SF is currently faulty and I am waiting for some instructions on that. If it requires reinstalling the application into SF, I would need to fix the issue, because the HPC Pack installer does not work if the SF is not 100% healthy.

likevi-MSFT commented 6 years ago

@crowbar27 Send me an email (it's in my bio) and I'll reply back with mitigation steps. I have not tested simply changing the locale of the host machine. I'll give that a try and let you know. The current mitigation steps involve setting the schedule to have dates that can interpreted as dd-mm or mm-dd.

Edit. Setting the format to English (United States) in Clock, Language, and Region -> Change date, time, or number formats should work. A machine reboot may be needed.

nembo81 commented 6 years ago

@likevi-MSFT I just tried : "Setting the format to English (United States) in Clock, Language, and Region -> Change date, time, or number formats should work. A machine reboot may be needed.",rebooted al the nodes and executed "Repair-ServiceFabricPartition -System" but the problem still remains.Do you have some ideas? PS:As I wrote you by email is not a prod env.

radderz commented 6 years ago

@likevi-MSFT I have a prod cluster having this occur, it is only maxing out the logging drive and still working. It is adding 250MB files ever minute so not much I can do to mitigate this one. We use NZT how would we do a mitigation to reduce the error dumps till the new version?

likevi-MSFT commented 6 years ago

@radderz send me an email. likevi [at] microsoft.com and I'll reply with the steps.

likevi-MSFT commented 6 years ago

@nembo81 can you send me an email. likevi [at] microsoft.com I'll reply with alternative mitigation steps. Could you include your GitHub handle in our email as well.

likevi-MSFT commented 6 years ago

This issue has been fixed in the latest version of 6.2. Updating the cluster will prevent this issue from occurring.