microsoft / service-fabric

Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.
https://docs.microsoft.com/en-us/azure/service-fabric/
MIT License
3.01k stars 399 forks source link

No Activity from Start-ServiceFabricClusterConfigurationUpgrade #812

Open AimanJaouhar opened 6 years ago

AimanJaouhar commented 6 years ago

Hi,

When running Start-ServiceFabricClusterConfigurationUpdate -ClusterConfigPath .\ClusterConfig1.0.2.json on a Standalone 5 node Windows Server 2012 R2 cluster (version 6.1.456.9494), the command quickly returns but no upgrade activity is performed.
The aim was to increase the KtlLogger's SharedLogSizeInMB parameter from 2048 to 4096.

An upgrade was successfully performed previously using a similar cluster config, that time with a decrease in that parameter value from 8192 to 2048.

Is there a way to get more details on this issue? There are no error messages, logs, etc.

Thank you.

rakshitatandon commented 6 years ago

Can you share your old and new cluster config as well as traces from FabricLogRoot ? You can send them to ratando@microsoft.com . We are working on improving the diagnostics experience.

qmarc commented 6 years ago

There appears to be a few outstanding issues, which could be a combination of my errors and MS documentation which poorly describes what functionality should work on-premise versus Azure cloud (there is a lack of distinction in the documentation) or the functionality for on-premise cluster deployments simply is not there. The online docs does not tell you how to describe some elements, an example being the KtlLogger, the SharedLogId, is the guid a randomly generated thing? Is it meant to be associated with something internally known to Service Fabric, what is the guid based on??

        {
          "name": "KtlLogger",
          "parameters": [
            {
                "name": "SharedLogSizeInMB",
                "value": "8192"
            },
            {
                "name":"SharedLogId",
                "value":"{dc55066f-e180-4806-9840-3415ef71ffeb}"
            },
            {
                "name":"SharedLogPath",
                "value":"h:\\replicatorshared.log"
            }
          ]
        }

Using that and the SharedLogPath as described in the online doc as a recommended step to create a more efficient approach, as quoted here:

The SharedLogId and SharedLogPath settings are always used together to make a service use a separate shared log from the default shared log for the node. For best efficiency, as many services as possible should specify the same shared log

Putting this into practice as shown above though throws exceptions in the cluster services:

Re-throwing Unexpected Exception in EnsureSharedLogContainerAsync: Failed to create h:\replicatorshared.log logID: dc55066f-e180-4806-9840-3415ef71ffeb. Type: System.IO.FileNotFoundException Message: The system cannot find the file specified. (Exception from HRESULT: 0x80070002) HResult: -2147024894 Stack:    at System.Fabric.Data.Log.Interop.NativeLog.IKPhysicalLogManager.EndCreateLogContainer(IFabricAsyncOperationContext Context, IKPhysicalLogContainer& Result)
   at System.Fabric.Data.Log.Interop.PhysicalLogManager.<CreateContainerAsync>b__8_1(IFabricAsyncOperationContext Context)
   at System.Fabric.Interop.AsyncCallOutAdapter2`1.Finish(IFabricAsyncOperationContext context, Boolean expectedCompletedSynchronously)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Fabric.Data.Log.LogManager.<OnCreatePhysicalLogAsync>d__16.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.ServiceFabric.Replicator.KtlLogManager.<EnsureSharedLogContainerAsync>d__22.MoveNext()

and

Exception in OpenAsync.  Type: System.IO.FileNotFoundException Message: The system cannot find the file specified. (Exception from HRESULT: 0x80070002) HResult: 0x80070002. Stack Trace:    at System.Fabric.Data.Log.Interop.NativeLog.IKPhysicalLogManager.EndCreateLogContainer(IFabricAsyncOperationContext Context, IKPhysicalLogContainer& Result)
   at System.Fabric.Data.Log.Interop.PhysicalLogManager.<CreateContainerAsync>b__8_1(IFabricAsyncOperationContext Context)
   at System.Fabric.Interop.AsyncCallOutAdapter2`1.Finish(IFabricAsyncOperationContext context, Boolean expectedCompletedSynchronously)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Fabric.Data.Log.LogManager.<OnCreatePhysicalLogAsync>d__16.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.ServiceFabric.Replicator.KtlLogManager.<EnsureSharedLogContainerAsync>d__22.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.ServiceFabric.Replicator.KtlLogManager.<InitializeAsync>d__17.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.ServiceFabric.Replicator.LoggingReplicator.<OpenAsync>d__141.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.ServiceFabric.Replicator.LoggingReplicator.<OpenAsync>d__140.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.ServiceFabric.Replicator.DynamicStateManager.<OpenAsync>d__116.MoveNext().

h drive by the way for us is an SMB globally mapped drive that points to a CSV. This works perfectly for Diagnostics connection string by the way!

Common names, here is another issue! Again best practices for a production environment suggests using common names rather than thumbnails in your cluster configuration file so it is easier to roll-over certificates etc. However I cannot get it to work! This applies to Cluster, Server and Client certs. Here is a config file example snippet of how I have defined the Client cert (clearly defining both thumb and CN should not be expected):

"ClientCertificateThumbprints": [
    {
        "CertificateThumbprint": "6599ff2988384bf4ae204020e31b715d",
        "IsAdmin": true
    },
    {
        "CertificateThumbprint": "09ab452ad69a4ff38f412f686b3b376c",
        "IsAdmin": false
    }
],
"ClientCertificateCommonNames": [
  {
      "CertificateCommonName": "CN=admin@ourdomain.com.au.local",
      "CertificateIssuerThumbprint": "02f5e385f3a14239aef0825c0a99f8f1",
      "IsAdmin": true
  },
  {
      "CertificateCommonName": "CN=readonly@ourdomain.com.au.local",
      "CertificateIssuerThumbprint": "02f5e385f3a14239aef0825c0a99f8f1",
      "IsAdmin": false
  }
]

The fabric settings -> Setup -> FabricLogRoot, why does this have to output to a local disk only? all these logs cause space issues on the individual nodes, any reason why it cannot use a shared resource like Diagnostics -> Connectionstring can?

And like the person who started this post, if I updated my cluster information, anything! It does not do a thing. I supplied the original config, added the extra properties, incremented the version number of clusterConfigurationVersion to 1.0.2 called Start-ServiceFabricClusterConfigurationUpgrade and then checked the progress of the upgrade with Get-ServiceFabricClusterConfigurationUpgradeStatus the result looks like:

image

Edit: It appears that the Start-ServiceFabricClusterConfigurationUpgrade (or Cluster upgrade) cannot be run using PowerShell remotely, I had to add the admin client certificate on a Node and run the command from there in a RDC session, then the upgrade status changed.