microsoft / service-fabric-issues

This repo is for the reporting of issues found with Azure Service Fabric.
168 stars 21 forks source link

Unhelpful exception in Start-ServiceFabricApplicationUpgrade #1306

Closed JefSchraag closed 6 years ago

JefSchraag commented 6 years ago

While trying to upgrade my application using the Start-ServiceFabricApplicationUpgrade, I get this exception:

Start-ServiceFabricApplicationUpgrade -ApplicationName fabric:/IbeMondayBlue -ApplicationTypeVersion 1841.0 -UnmonitoredManual -ApplicationParameter $params

Start-ServiceFabricApplicationUpgrade : aka.ms/upgrade-defaultservices
At line:1 char:5
+     Start-ServiceFabricApplicationUpgrade -ApplicationName fabric:/Ib ...
+     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (Microsoft.Servi...usterConnection:ClusterConnection) [Start-ServiceFabricApplicationUpgrade], FabricException
    + FullyQualifiedErrorId : UpgradeApplicationErrorId,Microsoft.ServiceFabric.Powershell.StartApplicationUpgrade

The URL aka.ms/upgrade-defaultservices takes me to a general article mentioning which service parameter changes are supported.

The problem is that I have not changed any default service in this upgrade. Also, the application parameters I am passing as the -ApplicationParameter argument, are exactly the same as the current parameters (checked via Get-ServiceFabricApplication -ApplicationName fabric:/IbeMondayBlue)

JefSchraag commented 6 years ago

I just tried to reproduce the problem in another application instance. I expected this sequence of commands to result in the same exception.

First

New-ServiceFabricApplication -ApplicationName fabric:/IbeMondayBlueAlt -ApplicationTypeName IbeType -ApplicationTypeVersion 1840.0 -ApplicationParameter $params

Then

Start-ServiceFabricApplicationUpgrade -ApplicationName fabric:/IbeMondayBlueAlt -ApplicationTypeVersion 1841.0 -UnmonitoredManual -ApplicationParameter $params

...
TargetApplicationTypeVersion  : 1841.0
UpgradeKind                   : Rolling
ForceRestart                  : False

In this case, the upgrade started as expected. Leaving me confused.

motanv commented 6 years ago

Hello @JefSchraag, Just to make sure, You issued the following two cmdlets and got the exception: (1) New-ServiceFabricApplication -ApplicationName fabric:/IbeMondayBlue -ApplicationTypeName IbeType -ApplicationTypeVersion 1840.0 -ApplicationParameter $params

(2) Start-ServiceFabricApplicationUpgrade -ApplicationName fabric:/IbeMondayBlue -ApplicationTypeVersion 1841.0 -UnmonitoredManual -ApplicationParameter $params

Then the only change you made is to the instance name of the application and was able to get the upgrade started with the following two cmdlets

(1) New-ServiceFabricApplication -ApplicationName fabric:/IbeMondayBlueAlt -ApplicationTypeName IbeType -ApplicationTypeVersion 1840.0 -ApplicationParameter $params

(2) Start-ServiceFabricApplicationUpgrade -ApplicationName fabric:/IbeMondayBlueAlt -ApplicationTypeVersion 1841.0 -UnmonitoredManual -ApplicationParameter $params

Is that correct, specifically you did not change anything except for the instance name?

Also could you please share the 1840.0 and 1841.0 application manifests and the ApplicationParameters ($params) that you are passing in?

Can you reproduce the exception with the instance name fabric:/IbeMondayBlue, if yes, could you use -Verbose switch in the Start-ServiceFabricApplicationUpgrade cmdlet so that we may get more details on the exception?

Thanks, Tanvir

JefSchraag commented 6 years ago

Hi @motanv,

The fabric:/IbeMondayBlue application was instantiated many versions ago. It was not newly created at version 1840.0, it was upgraded from another version to get to 1840.0.

And indeed, the fabric:/IbeMondayBlueAlt version was newly instantiated at version 1840.0.

These are the application manifest versions and parameters file.

And this is the output of the Start-ServiceFabricApplicationUpgrade command with the -Verbose flag:

Start-ServiceFabricApplicationUpgrade -ApplicationName fabric:/IbeMondayBlue -ApplicationTypeVersion 1841.0 -ApplicationParameter $params -UnmonitoredManual -Verbose
VERBOSE: System.Fabric.FabricException: aka.ms/upgrade-defaultservices ---> System.Runtime.InteropServices.COMException: Exception from HRESULT: 0x80071BDF
   at System.Fabric.Interop.NativeClient.IFabricApplicationManagementClient10.EndUpgradeApplication(IFabricAsyncOperationContext context)
   at System.Fabric.Interop.Utility.<>c__DisplayClass22_0.<WrapNativeAsyncInvoke>b__0(IFabricAsyncOperationContext context)
   at System.Fabric.Interop.AsyncCallOutAdapter2`1.Finish(IFabricAsyncOperationContext context, Boolean expectedCompletedSynchronously)
   --- End of inner exception stack trace ---
VERBOSE: System.Management.Automation.PipelineStoppedException: The pipeline has been stopped.
   at System.Management.Automation.MshCommandRuntime.ThrowTerminatingError(ErrorRecord errorRecord)
   at System.Management.Automation.Cmdlet.ThrowTerminatingError(ErrorRecord errorRecord)
   at Microsoft.ServiceFabric.Powershell.ApplicationCmdletBase.<>c__DisplayClass10_0.<UpgradeApplication>b__0(Exception ae)
   at System.AggregateException.Handle(Func`2 predicate)
   at Microsoft.ServiceFabric.Powershell.ApplicationCmdletBase.UpgradeApplication(ApplicationUpgradeDescription upgradeDescription)
   at Microsoft.ServiceFabric.Powershell.StartApplicationUpgrade.ProcessRecord()
Start-ServiceFabricApplicationUpgrade : aka.ms/upgrade-defaultservices
At line:1 char:5
+     Start-ServiceFabricApplicationUpgrade -ApplicationName fabric:/Ib ...
+     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (Microsoft.Servi...usterConnection:ClusterConnection) [Start-ServiceFabricApplicationUpgrade], FabricException
    + FullyQualifiedErrorId : UpgradeApplicationErrorId,Microsoft.ServiceFabric.Powershell.StartApplicationUpgrade

Thanks for looking into this,

Jef Schraag

motanv commented 6 years ago

@JefSchraag, thanks for sharing the application manifests. Between the two versions of the application, was there any change in any of the service manifests belonging to one of the default services listed in the application manifest?

motanv commented 6 years ago

@JefSchraag, the reason for asking about the service manifests is that recently there was an issue where customer did not think he changed the default service description because he did not touch the application manifest, but he did modify the placement constraints in the service manifest (for the default service). If you did not change the service manifests either, then I would need the Service Fabric traces to root cause the issue and provide a mitigation.

JefSchraag commented 6 years ago

Hi @motanv , the only service manifest change between 1840 and 1841 is a version number change in the version number on the <ServiceManifest> element and its child <CodePackage>.

A summary of what is going on.

case old new outcome
old instance was upgrade from an even older instance 1840.0 1841.0 Start-ServiceFabricApplicationUpgrade fails without clear reason
old instance created from scratch 1840.0 1841.0 Start-ServiceFabricApplicationUpgrade runs as expected
motanv commented 6 years ago

Hello @JefSchraag , Could you please share the sf logs with us ? If this is a Windows one-box cluster, you may find them in “Traces” folder like C:\SfDevCluster\Log\Trace and they look like “fabric_traces_6.3.187.9494_131848675483647056_0.etl”

C:\Users\motanv>reg query "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Service Fabric"

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Service Fabric FabricRoot REG_EXPAND_SZ C:\Program Files\Microsoft Service Fabric FabricBinRoot REG_EXPAND_SZ C:\Program Files\Microsoft Service Fabric\bin FabricCodePath REG_EXPAND_SZ C:\Program Files\Microsoft Service Fabric\bin\Fabric\Fabric.Code FabricDataRoot REG_EXPAND_SZ C:\SfDevCluster\Data FabricLogRoot REG_EXPAND_SZ C:\SfDevCluster\Log FabricVersion REG_EXPAND_SZ 6.3.187.9494 CompatibilityJsonPath REG_EXPAND_SZ C:\Program Files\Microsoft Service Fabric\bin\fabric\fabric.code\SdKRuntimeCompatibility.json UseFabricInstallerSvc REG_DWORD 0x1 EnableCircularTraceSession REG_DWORD 0x1 NodeLastBootUpTime REG_SZ 10/19/2018 5:27:54 PM FabricDnsServerIPAddress REG_SZ 10.31.60.134

What is the Service Fabric runtime version that you are using?

Thanks, Tanvir

JefSchraag commented 6 years ago

Hi Tanvir,

The logs you requested are available here: https://1drv.ms/u/s!Am3hNHZWyp0auvVjCdK1vpNkX1WJEQ

I took the logs from the machine in the cluster that acted as the gateway (GatewayInformation data from Connect-ServiceFabricCluster)

I issued the Start-ServiceFabricApplicationUpgrade command at:

Jef

motanv commented 6 years ago

Thanks @JefSchraag, looks like you are on SF runtime 6.3 CU1 (6.3.176) I am actually more interested in the traces from the node where the system service ClusterManager (CM) primary lives, the traces you shared does not looks like from the node where CM primary was.

If this is a cluster you created from portal.azure.com (or you have deployed via ARM template) you could get the Cluster Resouce ID and region of the cluster from the portal.azure.com and share with me so that I can create a SAS (Shared Access Signature) key.

JefSchraag commented 6 years ago

Hi Tanvir,

The logs from the machine hosting the primary CM are available here: https://1drv.ms/u/s!Am3hNHZWyp0auvV6b2sEZe48o33y9A

I noticed this warning in those logs:


<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-ServiceFabric" Guid="{cbd93bc2-71e5-4566-b3a7-595d8eeca6e8}" />
    <EventID>29441</EventID>
    <Version>1</Version>
    <Level>3</Level>
    <Task>115</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000001</Keywords>
    <TimeCreated SystemTime="2018-11-02T07:17:57.811704500Z" />
    <EventRecordID>170323</EventRecordID>
    <Correlation />
    <Execution ProcessID="60704" ThreadID="34108" ProcessorID="0" KernelTime="6581" UserTime="90975" />
    <Channel>
    </Channel>
    <Computer>NTA-WS-15.netmatch.local</Computer>
    <Security />
  </System>
  <EventData>
    <Data Name="id">
    </Data>
    <Data Name="type">ApplicationUpgradeContext</Data>
    <Data Name="text">[(00000000-0000-0000-0000-000000002000:131611764421385883)+7ca7339e-6cd2-4010-a78f-190e6e5ff734:0] modified default service: fabric:/IbeMondayBlue/CatalogResolver ():IbeType_App390:CatalogResolverPkg:CatalogResolverType@fabric:/IbeMondayBlue [SharedProcess] (1.0:1.0:0) [1]/3/3 SP 0:0    ()  0 () 1 () false : (1, -1) (c7fcc222-90c7-43d6-8df0-d42e07aca8a8) - [0,0] () -&gt; fabric:/IbeMondayBlue/CatalogResolver ():IbeType_App390:CatalogResolverPkg:CatalogResolverType@fabric:/IbeMondayBlue [SharedProcess] (1.0:1.0:0) [1]/3/3 SP 0:0    ()  0 () 0 () false : (1, -1) (4bbc3a9b-71e4-4be5-9823-1ffc39fb476c) - [0,0] ()</Data>
  </EventData>
</Event>

Looking at the log text, it looks like there is something misaligned for the IbeMondayBlue/CatalogResolver service.

fabric:/IbeMondayBlue/CatalogResolver ():IbeType_App390:CatalogResolverPkg:CatalogResolverType@fabric:/IbeMondayBlue [SharedProcess] (1.0:1.0:0) [1]/3/3 SP 0:0    ()  0 () 1 () false : (1, -1) (c7fcc222-90c7-43d6-8df0-d42e07aca8a8) - [0,0] () -> 
fabric:/IbeMondayBlue/CatalogResolver ():IbeType_App390:CatalogResolverPkg:CatalogResolverType@fabric:/IbeMondayBlue [SharedProcess] (1.0:1.0:0) [1]/3/3 SP 0:0    ()  0 () 0 () false : (1, -1) (4bbc3a9b-71e4-4be5-9823-1ffc39fb476c) - [0,0] ()

I don't know the meaning of the different values.

fabric:/IbeMondayBlue/CatalogResolver ():IbeType_App390:CatalogResolverPkg:CatalogResolverType@fabric:/IbeMondayBlue [SharedProcess] (1.0:1.0:0) [1]/3/3 SP 0:0 () 0 () 0 () false : (1, -1) (4bbc3a9b-71e4-4be5-9823-1ffc39fb476c) - [0,0] ()

motanv commented 6 years ago

Thanks Jef, yes this is the trace I was looking for, it appears that DefaultMoveCost (https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-resource-manager-movement-cost) has changed (from 1 to 0) for the service fabric:/IbeMondayBlue/CatalogResolver, could you check if you are changing (or removing) it either in application/service manifest or programmatically?

JefSchraag commented 6 years ago

Hi Tanvir,

I checked and we have not changed any DefaultMoveCost setting. I would not even know how to do that in the application manifests default services.

Can it be that this setting was automatically applied by a system service?

motanv commented 6 years ago

Thanks Jef, could you also share the traces from the FM primary node?

motanv commented 6 years ago

Also, if you do not mind the decrease in DefaultMoveCost, you can set EnableDefaultServicesUpgrade to true (https://aka.ms/upgrade-defaultservices) in the cluster, afterwards the app upgrade should not fail. We still need to find why the DefaultMoveCost changed though, so would need traces from the FM primary node.

motanv commented 6 years ago

Hello @JefSchraag, since we have not heard from you for some time, I am closing this issue. If you still face the issue, please open a new issue.

JefSchraag commented 6 years ago

Hi @motanv,

I am still trying to find some time to make the upgrade work.

However, one of the issues that made this problem larger than needed, is that the exception message in the Start-ServiceFabricApplicationUpgrade command was not helpful at all. This has cost us a lot of time. It would be very helpful if the exception message at least made a clear statement that the upgrade is failing due to default service settings and perhaps refer to the logs for more details.

Thanks,

Jef Schraag

oanapl commented 6 years ago

@JefSchraag , we have improved the error message (we had a bug and the message was truncated due to an unsupported character). The change will be in our 6.4 release.