microsoft / service-fabric

Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.
https://docs.microsoft.com/en-us/azure/service-fabric/
MIT License
3.03k stars 401 forks source link

ServiceFabric Managed Identity returns 500 errors when cluster AAD Auth is enabled #416

Open johncrim opened 5 years ago

johncrim commented 5 years ago

SF Linux cluster running Ubuntu 16.04. (https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-tutorial-create-vnet-and-linux-cluster)

Follow both of these instructions:

  1. Enable cluster AAD auth: https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-creation-setup-aad
  2. Enable SF service managed identity: https://docs.microsoft.com/en-us/azure/service-fabric/how-to-managed-identity-service-fabric-app-code

Then write some service code that sends a request to the MSI_ENDPOINT, just like the doc here: https://docs.microsoft.com/en-us/azure/service-fabric/how-to-managed-identity-service-fabric-app-code#accessing-key-vault-from-a-service-fabric-application-using-managed-identity

This URL is returning 500 errors every time:

http://(host IP):2377/metadata/identity/oauth2/token?api-version=2019-07-01-preview&resource=https://keyvault.azure.com/

If I rebuild the cluster but skip the "Enable cluster AAD auth" step, the same metadata/identity requests return managed identity tokens, as expected.

This is one of several problems with fundamental functionality that I've encountered because we're using AAD authentication (which is recommended in the doc above). One other (which may be related) is: https://github.com/microsoft/service-fabric/issues/399 (I reported this issue to support in June, after putting up with it for a couple months).

It seems to be the case that one should NOT enable AAD authentication for Azure linux clusters.

dragav commented 4 years ago

@johncrim I think there may be some confusion regarding AAD-enabled clusters and SF managed identities - put simply, the two are completely unrelated features. While they can be enabled independently (and should work side-by-side), they can't be combined - that is, you won't be able to use a service's SF MI to authenticate to an AAD-enabled cluster. (I couldn't tell from your report whether that is something you intend to do, or just an observation of incompatibility.)

Enabling AAD-based authentication refers to clients connecting to the cluster's management endpoints using access tokens, and requires the manual creation and configuration of native AAD applications in the customer's tenant. For Linux clusters, this is strictly a cluster creation time directive - it cannot be enabled on pre-existing clusters. Enabling this feature has no effect on whether the cluster supports Managed Identities or not.

Enabling SF Managed Identity in a cluster is described here (for existing clusters; new cluster deployment steps are here.)

If, indeed, you have created a new Linux cluster enabled for AAD-based authentication and enabled for MI, and you are observing the InternalServerError exceptions noted above, please share the cluster resource id and region, as well as an approximate timestamp. (If you'd prefer not to disclose this information publicly, please feel free to email me directly - dragosav at microsoft.) We'll attempt to repro this on our end as well.

Thanks.

johncrim commented 4 years ago

Hi @dragav - thanks for the reply.

I'm well aware that AAD-enabled clusters and SF managed identity are completely distinct, and shouldn't be related. This bug is that they don't work side-by-side. I spent a few days trying to figure out why the managed identity endpoint was returning 500 errors (with no useful error messages in logs), and since I'd had other odd problems (with no data to debug them) related to having AAD enabled (one example is #399), I decided to try switching to certificate auth for cluster client communications, and the managed identity endpoint started working. This was the only thing that changed in our setup.

I followed all the instructions for setting up AAD, and had it working for a while before adding SF managed identity. I followed all the docs closely while setting up Managed Identity, and couldn't get it to work, until I removed the AAD auth.

The cluster I was using is no longer using AAD auth. It was a significant investment to migrate our ARM templates, deploy code, etc to use certificates instead of AAD (because of this bug), so it's not a trivial change for us to switch it back for your testing. (It was also a significant investment to get AAD cluster auth working - the docs suck, which might be why it doesn't seem to be well-tested). We certainly would like to switch back to AAD auth, securing users by identities is better than a couple of certificates that are shared with users and deploy code. However, after all the time we've already wasted on this, we can't just switch back to AAD for your investigation (sorry).

Also, if you do investigate this, I would really appreciate it if you could look at #402. Both of these bugs require using ARM to deploy SF apps to a linux cluster, so if you go to the effort of setting that up, bug #402 should be easy to investigate as well.

johncrim commented 4 years ago

This is still an issue in ServiceFabric 7.1 on Ubuntu 16. We've updated our SF code to match the instructions here: https://docs.microsoft.com/en-us/azure/service-fabric/how-to-managed-identity-service-fabric-app-code

We tried turning AAD auth on again, and it broke the ManagedIdentityTokenService.

Digging in deeper, the 500 errors returned from the token service (only when AAD is enabled) correspond to this error in the ServiceFabric trace log:

2020-7-9 20:51:53.278,Informational,66271,66256,HttpGateway.HttpCertificateAuthHandler,"Client certificate missing for uri: http://localhost/Nodes/vm0/$/HostedActiveCodePackage/6cf7d9d8-cd99-4cea-a85b-c9818023ba60?api-version=3.0, error: FABRIC_E_INVALID_CREDENTIALS."
2020-7-9 20:51:53.278,Informational,66271,66256,HttpGateway.HttpGatewayRequestHandler,"CheckAccess failed for URL: http://localhost/Nodes/vm0/$/HostedActiveCodePackage/6cf7d9d8-cd99-4cea-a85b-c9818023ba60?api-version=3.0, from , operation: POST, responding with ErrorCode: 403 authheader: :, ClientRequestId: 485a42c6-c386-ae46-9ce6-7a2bda13fda5"
2020-7-9 20:51:53.278,Informational,66271,66256,HttpGateway.HttpGatewayRequestHandler,"Responding with header: 403, description: Client certificate required for ClientRequestId 485a42c6-c386-ae46-9ce6-7a2bda13fda5."
2020-7-9 20:51:53.278,Warning,96435,76812,ManagedIdentityTokenService.CommunicationClientBase,"Request failed: Method: POST, RequestUri: 'https://10.11.5.4:19080/Nodes/vm0/$/HostedActiveCodePackage/6cf7d9d8-cd99-4cea-a85b-c9818023ba60?api-version=3.0', Version: 1.1, Content: System.Net.Http.StringContent, Headers:
    {
      Content-Type: text/plain; charset=utf-8
      Content-Length: 4
    }
    Response:
    StatusCode=Forbidden Reason=Client certificate required"

I decompiled the ManagedIdentityTokenService, and there's no code related to bearer tokens (only client certificates). So, it continues to use certificate authentication whether AAD is enabled for the cluster or not.

It also seems like a bug in the ServiceFabric API that it reports "Client certificate required" when it won't use the client certificate (which is what is happening here - I'm pretty sure the client certificate is being passed in, but the API is not using it). The error FABRIC_E_INVALID_CREDENTIALS seems correct, but the parts about "Client certificate missing" seem incorrect.

This error goes away when AAD auth is turned off for the cluster.

I'm very frustrated that ManagedIdentityTokenService wasn't tested with AAD auth enabled. We've wasted a lot of time setting up AAD auth and managed identity, and they don't work together.

dragav commented 4 years ago

AAD-based auth is in addition to, not instead of certificate-based auth in a cluster. Furthermore, AAD-based auth is meant for external clients connecting to the cluster management endpoint - SF components will not substitute AAD identities for internal communications.

In the traces/messages you identified above, MITS calls the local node's Fabric http gateway to discover/retrieve the identity description of the SF-hosted application which requested an access token. That call (between 2 SF components) is expected to authenticate with the cluster certificate, and evidently MITS does not have access to it, or failed to load it. The gateway, in turn, checks for an existence of a bearer token in the headers of the incoming request, and if that's missing (which is, again, expected in this case) falls back to a certificate-based authentication.

It's strange that you only see this in AAD-enabled clusters; as explained previously, the fact that the cluster is enabled for AAD-based auth is orthogonal to the MITS-Hosting communication. Is it possible that the certificate management (ie provisioning certs) differs between your AAD- and non-AAD-enabled clusters?

I understand your frustration; we'll expand our testing coverage of MI to include this scenario. Thanks for sharing the details of your investigation.

johnc-ftl commented 4 years ago

Thank you very much for the reply and insights, @dragav .

What you're saying makes sense, and I would expect that certificate authentication would continue to work whether AAD is enabled or not (it certainly works for setting up the cluster) - when AAD is enabled the cluster is created fine, we don't see errors until we deploy a service that uses MSI.

I thought the same - based on this error message, perhaps MITS is not loading the cluster certificate or it's not available or not being used for another reason. While there's no logging that MITS has loaded or is using the cluster certificate, I created a simple command-line test app that both reports the users local certs, and sends a POST request to an API using these certs. I reviewed the decompiled MITS code to confirm that this test app is consistent with what MITS is doing, and I am seeing the exact same stack trace when AAD is enabled and when AAD is disabled.

SFTest.zip

I've attached the test app code to this comment. This test app succeeds with the HostedActiveCodePackage API when AAD is not enabled, and it fails when AAD is enabled with the exact same HTTP response code and reason as that reported in the SF logs. In both cases I can confirm that the same certificate is present in the sfuser CurrentUser/My store. Since this test app works with the HostedActiveCodePackage API when AAD is not enabled, and the cert is present in both cases, I'm pretty confident that the API is what is rejecting or not using the cert, instead of it not being passed in the AAD enabled case.

dragav commented 4 years ago

@johnc-ftl many thanks for your patience and collaboration on this issue, as well as sharing the repro app - we'll put it to good use. In the meantime we started deploying the MI sample app to our AAD-enabled Linux clusters, and are tracking this work with priority. We'll come back with updates early next week.

Thanks again for the patience and the assistance.

fuocor commented 2 years ago

'Next week' was 18 months ago. Has anything moved on this?

johncrim commented 2 years ago

Not that I'm aware of - we're still using certificate authentication on our clusters, due to this.

dragav commented 2 years ago

The SF http gateway on Windows is built on http.sys, which supports client certificate (re)negotiation: the server hello message does not require a client cert, then the authn handler checks for an auth header (bearer token) before lastly prompting the client for a cert.

On Linux this mixed-mode authn is not supported, or we haven't found a way to achieve the same behavior in OpenSSL. As a consequence, enabling AAD/token-based auth breaks certain scenarios, including the MITS flow. "Fixing" this is a breaking change, as it would require separating the ports for each type of authentication. Given the relatively infrequent reporting of this issue (John's case notwithstanding), we've punted on this work so far.

However, other Windows scenarios (including support for TLS1.3) require the client cert negotiation to be enabled by default (which would cause a similar break); SF on Linux is also picking up interest, and so we will need to (re)prioritize this work. I can't comment on an ETA here/at this time.

fuocor commented 2 years ago

Thanks for the reply. This is a priority item as I require both ASAP.

johncrim commented 2 years ago

Thanks for the info @dragav - that's helpful.

MITS is pretty opaque to most users (just a local network endpoint) so I don't understand the backward compatibility issue. But I understand that this is not a priority if the Linux + AAD + managed identity use-case isn't used more widely. It's unfortunate that this sort of thing can't be fixed in more of an open-source contribution manner - it seems like an Azure internals issue.

dragav commented 2 years ago

@johncrim the SF MI flow is as follows: app -> MITS -> SF runtime via SF Gateway to confirm the app's identity - all on the local node. The issue described above occurs in this last leg (the http gateway on Linux doesn't handle mixed-mode auth), and so MITS just fails the request before even attempting to call the STS/backend. But the issue goes beyond MI - you'd encounter similar problems in the Backup/Restore or EventStore services, respectively.

johncrim commented 2 years ago

@dragav - ok, that makes more sense. I do think there should be a warning in the docs about this - setting up AAD auth for service fabric is a ton of work and definitely not a polished process, and there should be something saying "don't bother with this if you're on Linux and want to use managed identity on one or more services".

I think the warning should be on this page: https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-creation-setup-aad

Right next to the warning about "Linux AAD-enabled clusters cannot be viewed in Azure Portal" - which is another issue that I submitted (initially to support).