microsoft / service-fabric

Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.
https://docs.microsoft.com/en-us/azure/service-fabric/
MIT License
3.03k stars 400 forks source link

Microservice's, Kerberos security and AD forest permissions #748

Open qmarc opened 6 years ago

qmarc commented 6 years ago

We have an interesting scenario that forces use to place the machine (clustered node) account at the root level of the forest with read permissions applied to the current object and all descendants. We initially attempted to apply all permissions at the OU level only i.e. OU=ASF,OU=DEV,OU=DCNAME,OU=GEOLOC,OU=All Computers,DC=companyname,DC=com,DC=au,DC=local however this did not work.

The ASF cluster build we tested with is 6.2.274.9494 and as you can see from the cluster config snippet provided below we are using Certificate security for node-to-node and client-to-node scenarios:

"security": {
    "ClusterCredentialType": "X509",
    "ServerCredentialType": "X509",
    "CertificateInformation": {
        "ClusterCertificate": {
          "Thumbprint": "{{ primary_server_certificate_thumbprint }}",
          "X509StoreName": "My"
        },
        "ServerCertificate": {
          "Thumbprint": "{{ primary_server_certificate_thumbprint }}",
          "X509StoreName": "My"
        },
        "ReverseProxyCertificate": {
            "Thumbprint": "{{ primary_server_certificate_thumbprint }}",
            "X509StoreName": "My"
        },
        "ClientCertificateThumbprints": [
            {
                "CertificateThumbprint": "{{ client_admin_certificate_thumbprint }}",
                "IsAdmin": true
            },
            {
                "CertificateThumbprint": "{{ client_readonly_certificate_thumbprint }}",
                "IsAdmin": false
            }
        ]
    }

This means ASF is running under the context of the local machine account Network Service. If we now push a Micro Service into the cluster where the service will be running as an AD service account and we attempt to communicate with SQL instances or anything using Kerberos security secured by the Domain we will see that the service will keep trying to start but failing with the error:

Error getting user account information for addomain\svc-acc-name: status=5, error=E_ACCESSDENIED

If you then examine the EventLogs of the Node where the Micro Service is attempting to startup you will see ApplicationPrincipalAbortableError exceptions :

18/05/2018 12:19:10 PM Activate: Activate:ServiceFabricImpersonationType_App43:1.0, ErrorCode=ApplicationPrincipalAbortableError, RetryCount=14
18/05/2018 12:19:10 PM End(OpenVersionedApplication): Id=ServiceFabricImpersonationType_App43, Version=1.0, ErrorCode=ApplicationPrincipalAbortableError
18/05/2018 12:19:10 PM End(SetupApplicationEnvironment): Id=ServiceFabricImpersonationType_App43, ErrorCode=ApplicationPrincipalAbortableError
18/05/2018 12:19:10 PM ServiceFabricImpersonationType_App43: End ConfigureSecurityPrincipals: error ApplicationPrincipalAbortableError
18/05/2018 12:19:10 PM c048041e20e34b691b8de8c23c58969: App ServiceFabricImpersonationType_App43: SetupSecurityPrincipals failed with ApplicationPrincipalAbortableError, abort application principals
18/05/2018 12:19:10 PM AppId:ServiceFabricImpersonationType_App43 in NodeId:c048041e20e34b691b8de8c23c58969: ApplicationPrincipals::Open() fails. Error is: E_ACCESSDENIED
18/05/2018 12:19:10 PM Error getting user account information for addomain\svc-acc-name: status=5, error=E_ACCESSDENIED

Moving on from this we decided to experiment with using a gMSA so the ASF cluster would be more tightly integrated with our Domain using Kerberos security to for all node and client communication. We hoped this tight integration would alleviate this issue and therefore allowing us to remove the machine accounts from the root of the forest, however this did not prove to be fruitful at this stage, more testing around this is needs to be done!. Just to show what our cluster config looks like with the gMSA change:

"security": {
    "ClusterCredentialType": "Windows",
    "ServerCredentialType": "Windows",
    "WindowsIdentities": {
        "ClustergMSAIdentity": "{{ env_domain }}\\{{ cluster_gmsa_identity }}",
        "ClusterSPN": "{{ cluster_gmsa_spn }}",
        "ClientIdentities": [
            {
                "Identity": "{{ env_domain_short }}\\ServiceFabricAdmins",
                "IsAdmin": true
            },
            {
                "Identity": "{{ env_domain_short }}\\ServiceFabricReadOnly",
                "IsAdmin": false
            }
        ]
    },

Any help to resolve this problem would be greatly appreciated.

This issue is a duplicate of the original StackOverflow question

qmarc commented 6 years ago

After some further experimentation it appears that you can move the machine account (reading from 'This object and all descendants' ) to a branch deeper into the AD forest, closer to where your user account is.

Further to this, I discovered that if you run your services (rather than cluster) using a gMSA, then the Machine accounts can be removed from AD completely!