Closed crowbar27 closed 6 years ago
Hi, we have the same identical issue. We upgraded our 8 nodes cluster to the latest version (6.2.262.9494) last week and we noticed the issue after rebooting few nodes for windows updates. I already tried to restart all nodes,execute few powershell cmdlet like "Repair-ServiceFabricPartition -System" without luck. Is it a bug?
I have also tried Repair-ServiceFabricPartition
several times, but it seems not to help. I believe (but I am not completely sure) that the "Ready" partition moves from one node to the other sometime, but the other ones are always "down".
We tried It too but nothing happened.the service stop in every node in a infinite loop.
Il mar 24 apr 2018, 17:41 Christoph notifications@github.com ha scritto:
I have also tried Repair-ServiceFabricPartition several times, but it seems not to help. I believe (but I am not completely sure) that the "Ready" partition moves from one node to the other sometime, but the other ones are always "down".
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Azure/service-fabric-issues/issues/1003#issuecomment-383979743, or mute the thread https://github.com/notifications/unsubscribe-auth/Ah7EE_d0E-N8ggdZHcPmNEvVjLEXbdO8ks5tr0eSgaJpZM4Tf_pN .
I'm seeing the same thing. Cluster was fine for weeks, only having the traditional permanently "sick" DnsService (which seems to have no impact unless you use the obscure dns feature of SF and who would). Upgraded to latest version yesterday (I know, insane optimism), and now I'm seeing a sick FaultAnalysisService, which I assume is related to this "ChaosScheduler".
@Microsoft Please schedule less chaos on my dev machine.
@motanv please take a look
I'm looking into this issue right now.
@nembo81 I noticed the Italian (?) response header at the bottom of your last post. Do you have an Italian date format configured on your cluster? Mine is German although the OS language is English, and I think we use the same formatting for dates.
@crowbar27 @nembo81 @stofte Could you provide me some more information about the cluster that was experiencing this problem?
Hi, You are right,I have an eng win2016 with italian culture configured.I thought the same because that .NET error.I never changed it,neither tried to chance it now...did you?
Il mer 25 apr 2018, 21:06 Christoph notifications@github.com ha scritto:
@nembo81 https://github.com/nembo81 I noticed the Italian (?) response header at the bottom of your last post. Do you have an Italian date format configured on your cluster? Mine is German although the OS language is English, and I think we use the same formatting for dates.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Azure/service-fabric-issues/issues/1003#issuecomment-384399726, or mute the thread https://github.com/notifications/unsubscribe-auth/Ah7EE6H9ZIgK5BDaZ76EZgayFFcGYcnhks5tsMk2gaJpZM4Tf_pN .
I've reproduced the error on my machine and I'm working on a fix now.
@nembo81 @stofte @crowbar27 What kind of clusters are you using? (production, test, development) We are trying to understand the impact of this bug.
To anyone else facing this issue, please let me know as well.
If you need a mitigation for this issue, please send me a private direct message.
@likevi-MSFT I Have an 8 nodes Dev-test cluster : 1) on premise 2) I upgraded from version 6.1.480.9494 3) If you still need access,we can do it. 4) Italian format , short date dd/MM/yyyy,long date dddd d MMMM yyyy.
@nembo81 Thank you. We have a fix for this issue in the next release.
@likevi-MSFT
I have a three-node cluster as installed by HPC Pack 2016 (although I had to re-install SF manually from the standalone package).
dd.MM.yyyy
, long date format dddd, d. MMMM yyyy
, short time HH:mm
, long time HH:mm:ss
.@crowbar27 Thank you. We have fixed the problem for the next release. It was a locale issue involving date formats.
All clusters on machines with locales that have date formats where the month come after the day are impacted.
If you need a mitigation to this issue message me directly and I will walk you through the steps.
@likevi-MSFT That depends on whether the steps are other than changing the server's locale settings (I know how to do that) and on the result of another enquiry I have. The HPC Pack 2016 application we have installed in the SF is currently faulty and I am waiting for some instructions on that. If it requires reinstalling the application into SF, I would need to fix the issue, because the HPC Pack installer does not work if the SF is not 100% healthy.
@crowbar27 Send me an email (it's in my bio) and I'll reply back with mitigation steps. I have not tested simply changing the locale of the host machine. I'll give that a try and let you know. The current mitigation steps involve setting the schedule to have dates that can interpreted as dd-mm or mm-dd.
Edit.
Setting the format to English (United States) in Clock, Language, and Region -> Change date, time, or number formats
should work. A machine reboot may be needed.
@likevi-MSFT I just tried : "Setting the format to English (United States) in Clock, Language, and Region -> Change date, time, or number formats should work. A machine reboot may be needed.",rebooted al the nodes and executed "Repair-ServiceFabricPartition -System" but the problem still remains.Do you have some ideas? PS:As I wrote you by email is not a prod env.
@likevi-MSFT I have a prod cluster having this occur, it is only maxing out the logging drive and still working. It is adding 250MB files ever minute so not much I can do to mitigate this one. We use NZT how would we do a mitigation to reduce the error dumps till the new version?
@radderz send me an email. likevi [at] microsoft.com and I'll reply with the steps.
@nembo81 can you send me an email. likevi [at] microsoft.com I'll reply with alternative mitigation steps. Could you include your GitHub handle in our email as well.
This issue has been fixed in the latest version of 6.2. Updating the cluster will prevent this issue from occurring.
After reinstalling my service fabric cluster for the fourth time, it worked over the whole weekend and this morning. However, since noon FabricFAS.exe crashes every four minutes or so, leaving the following event log entries:
Event Source: Microsoft-Service Fabric Event ID: 62976 Details:
Followed by
Event Source: .NET Runtime Event ID: 1025 Details:
Followed by
Event source: Application Error Event ID: 1000 Details:
Any ideas what causes this issue? I am not aware of using any of the Chaos cmdlets ...