microsoft / service-fabric

Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.
https://docs.microsoft.com/en-us/azure/service-fabric/
MIT License
3.02k stars 399 forks source link

Placement Properties aren't evaluated correctly from Placement Constraints when name contains '-' dash #814

Open OlegKarasik opened 6 years ago

OlegKarasik commented 6 years ago

Consider the following definition of placement properties in ClusterConfig.json and ServiceManifest.xml.

ClusterConfig.json

"NodeTypes": [
  {
    "name": "NodeType1",
    "clientConnectionEndpointPort": "19000",
    "clusterConnectionEndpointPort": "19001",
    "leaseDriverEndpointPort": "19002",
    "serviceConnectionEndpointPort": "19003",
    "httpGatewayEndpointPort": "19080",
    "reverseProxyEndpointPort": "19081",
    "applicationPorts": {
      "startPort": "20001",
      "endPort": "20030"
    },
    "placementProperties": {
      **"is-custom-property"**: true
    }
  }
]

ServiceManifest.xml

<ServiceManifest Name="ServicePkg" Version="1.0.0">
  <ServiceTypes>
    <StatelessServiceType ServiceTypeName="ServiceType">
      <PlacementConstraints>**is-custom-property** == True</PlacementConstraints>
    </StatelessServiceType>
  </ServiceTypes>
</ServiceManifest>

When cluster configuration and service are deployed to Service Fabric there will be no issues or warnings but no replicas will be created for this service. If you rename is-custom-property to IsCustomProperty then everything will work.

P.S. This was found on Azure Standalone Cluster for Windows Server version 6.1.467.9494

masnider commented 6 years ago

Hm. I just tried to repro this on my local cluster and cannot. The service is placed successfully and reports no errors. That said, I did this by updating the constraints of a service that I'd already created, not via the manifests.

Can you share the output of Get-ServiceFabricServiceDescription for the service that you created that isn't getting placed?

Also, how are you creating these services normally?

We don't normally recommend putting the placement constraints in the manifesst because that couples the service -type- to a specific cluster (or clusters where the properties match). Usually it's better to parameterize them as a part of your default services descriptions or better yet to just be creating services imperatively.

OlegKarasik commented 6 years ago

Sorry for late response. Today I have reproduced the issue again. Here are more details:

Also, how are you creating these services normally?

I create services with placement constraints using in ApplicationManifest.xml configuration. It worth mentioning I do this on empty cluster (no applications or services are deployed).

We don't normally recommend putting the placement constraints in the manifesst because that couples the service -type- to a specific cluster (or clusters where the properties match). Usually it's better to parameterize them as a part of your default services descriptions or better yet to just be creating services imperatively.

I agree. Unfortunately currently this is required (probably I would try to change this in future but for now this is how I have to do this).

Here is the snippet from ApplicationManifest.xml


<DefaultServices>
  <Service Name="(name)" ServicePackageActivationMode="ExclusiveProcess">
    <StatelessService ServiceTypeName="(type)" 
                      InstanceCount="[InstanceCount]">
      <SingletonPartition />
      <PlacementConstraints>Is-Gateway == true</PlacementConstraints>
    </StatelessService>
  </Service>
</DefaultServices>

Here is the snippet from ClusterConfig.json


"nodeTypes": [
  {
    "name": "NodeType0",
    "clientConnectionEndpointPort": "19000",
    "clusterConnectionEndpointPort": "19002",
    "leaseDriverEndpointPort": "19001",
    "serviceConnectionEndpointPort": "19006",
    "httpGatewayEndpointPort": "19080",
    "reverseProxyEndpointPort": "19081",
    "applicationPorts": {
      "startPort": "30001",
      "endPort": "31000"
    },
    "placementProperties": {
      "Is-Gateway": true
    },
    "isPrimary": true
   }
]

Here is an error message from the Service Fabric Explorer


Error event: SourceId='System.FM', Property='State'.
Partition is below target replica or instance count.
(service URI) -1 1 0a0ff8e2-d643-4cca-bfd4-f236b7197dea
  (Showing 0 out of 0 replicas. Total available replicas: 0)

For more information see: http://aka.ms/sfhealth

Here is the output of Get-ServiceFabricServiceDescription


ApplicationName              : (application name)
ServiceName                  : (service name)
ServiceTypeName              : (service type name)
ServiceKind                  : Stateless
InstanceCount                : -1
PartitionScheme              : Singleton
PlacementConstraints         : Is-Gateway == true
DefaultMoveCost              : Zero
LoadMetrics                  : {}
CorrelatedServices           : None
PlacementPolicies            : None
ServicePackageActivationMode : ExclusiveProcess
ServiceDnsName               :
masnider commented 6 years ago

I've been able to repro this. Mighty weird and an unfortunate little bug.

@OlegKarasik thanks for reporting. What you can probably do in the meantime is either change the parameter name or just switch to using the built in NodeType constraint (since by default properties will always align to NodeType boundaries). We understand that NodeType can be less expressive (You really do want to say "Is-Gateway == true" or "HasSSD == true" instead of NodeType == NodeType1.

Meanwhile we'll see what it would take to fix this and get back to you.

masnider commented 6 years ago

Actually - before we go further, I just tried something and it worked and now I think we may have a different bug. Or a combination of things.

Try the following out for me: Create the service with the busted constraint and get it to where it's sitting unplaced.

Then issue the following: Update-ServiceFabricService -PlacementConstraints "Is-Gateway == True" -Stateless -ServiceName "$yourServiceNameHere"

Note the capital T in True.

When I do this, it all snaps together.

The bug doesn't appear to be the dash. It's a combination of two things:

  1. During the manifest generation, the "true" you put in the template gets turned into "True" in the actual cluster configuration and
  2. Our boolean check for this is case sensitive, which makes me think it's doing a string comparison rather than correctly interpreting these things as bools.

Still a couple weird behaviors that's resulting in it looking like the problem is the dash. Can you try flipping the value to something like "foo" and seeing if things still don't work? My bet is that will get translated correctly and thing the constraints will work.

OlegKarasik commented 6 years ago

I have created two nodes with dashed properties - bool and string:

"nodeTypes": [
    {
      "name": "NodeType0",
      "placementProperties": {
        "is-bool-property": true
      }
    },
    {
      "name": "NodeType1",
      "isPrimary": true,
      "placementProperties": {
        "is-string-property": "value"
      }
    }
]

Then I defined two services with requirements for these properties:

<Service Name="StatelessService" 
         ServicePackageActivationMode="ExclusiveProcess">
  <StatelessService ServiceTypeName="StatelessServiceType" 
                    InstanceCount="[StatelessService_InstanceCount]">
    <SingletonPartition />
    <PlacementConstraints>is-bool-property == true</PlacementConstraints>
  </StatelessService>
</Service>
<Service Name="StatefulService" 
         ServicePackageActivationMode="ExclusiveProcess">
  <StatefulService ServiceTypeName="StatefulServiceType" 
                   TargetReplicaSetSize="[StatefulService_TargetReplicaSetSize]" 
                   MinReplicaSetSize="[StatefulService_MinReplicaSetSize]">
    <UniformInt64Partition PartitionCount="[StatefulService_PartitionCount]"
                           LowKey="-9223372036854775808" 
                           HighKey="9223372036854775807" />
    <PlacementConstraints>is-string-property == value</PlacementConstraints>
  </StatefulService>
</Service>

After deployment Stateless Service has the same error as described in the bug but Stateful Service was fine and healthy.

From the Service Fabric Explorer Stateless Service has the following placement constraint configuration:

Placement Constraints is-bool-property == true

I've executed:

Update-ServiceFabricService `
  -PlacementConstraints "is-bool-property == True" `
  -Stateless `
  -ServiceName "fabric:/App/StatelessService"

and this call modified Stateless Service placement constraint to

Placement Constraints is-bool-property == True

Right after that Service Fabric created service replica!

This was kind of suspicios (too easy) and I have done the reverse operation (Note the small 't'):

Update-ServiceFabricService `
  -PlacementConstraints "is-bool-property == true" `
  -Stateless `
  -ServiceName "fabric:/App/StatelessService"

In the Service Fabric Explorer placement constraints were changed to

Placement Constraints is-bool-property == True

but nothing has changed. No service replica was removed or so. The service was healthy and ready to work.

That is why I decided to try a crazy thing. I have redeployed my app again (and got the error state) and then updated Stateless Service with the same value as it was specified in ApplicationManifest.xml

Update-ServiceFabricService `
  -PlacementConstraints "is-bool-property == true" `
  -Stateless `
  -ServiceName "fabric:/App/StatelessService"

but nothing happen :(

So at least the behavior described above is expected (I am talking about changing constraint back to small 't') then your assumption is probably the right one.