Open jberezanski-mdg opened 4 years ago
@dragav FYI.
@jberezanski-mdg thanks for the detailed feedback, much appreciated. A few clarifications to your points:
Here is one cluster which was created about three weeks ago using SF 7.1.409.9590 and now exhibits the certificate validation issue:
Resource ID: /subscriptions/cf3bca15-63ae-49f2-927f-d4ebce661331/resourcegroups/ecgtest-azn-sf-rg/providers/Microsoft.ServiceFabric/clusters/ecgtest-azn-sf
Region: West Europe
Cluster creation timestamp: Wed Jul 08 2020 15:14:19 GMT+0200
Last failed attempt to change cluster settings: Fri Jul 24 2020 20:49:52 GMT+0200
We will soon be applying the workaround of setting EnforcePrevalidationOnSecurityChanges = true, while at the same time working on the proper solution of extending our deployment pipeline to ensure all required root and intermediate certificates are installed on cluster nodes (yes, I realize the chain needs to be buildable only by the Cluster Manager primary replica, but the implementation effort of installing the certificates on all nodes is the same and the resulting state is more reliable).
About documentation - yes, the "misc cert settings" section does imply chain building is required. However, the section "Thumbprint-based certificate validation declarations" earlier on that same page claims that "Any trust errors encountered during chain building or validation will be suppressed for thumbprint-based declarations" and "partial chain are considered non-fatal errors", so it was really surprising for me that chain building failure resulted in a failed upgrade. I would also expect to see this change of behavior described in 7.1 release notes.
It would also be useful to include installation of CA certificates in the sample cluster ARM templates.
First of all, I do apologize for asking you to disclose the details of your cluster/deployment; I hope my request did not cause discomfort/uneasiness. Please feel free to reach out to me directly, if you have concerns regarding conducting this dialog on a quasi-public forum - I'm dragosav at microsoft dot com.
On to your observations: there is a nuance here that may require clarification (and perhaps a document update): the pre-validator runs on a single node, without the benefit or the context of a true remote peer. The nodes of the cluster, on the other hand, run by default in a 'remote peer trust' mode, whereas if another node presents the same key as this node, it is considered trusted - they both possess the same private key.
So yes, the documentation is correct on both points, though it doesn't clarify why.
There are some traces from your test cluster (anonymized) demonstrating that authentication succeeded despite the failure in building the chain; client-server authentication (within the cluster):
2020-7-24 20:46:10.997 Transport.SecurityContextSsl 5688 6772 VerifyCertificate: remoteAuthenticatedAsPeer = false, incoming: tp='17..47', against: certThumbprintsToMatch='17..47,85..b3', x509NamesToMatch='('westeurope.servicefabric.azure.com',issuer=)', certChainFlags=0x40000000, maskedFlags=0x40000000 shouldIgnoreCrlOffline=true shouldAcceptExpiredCert=false
2020-7-24 20:46:11.007 Transport.SecurityContextSsl 5688 6772 usage: 1.3.6.1.5.5.7.3.2, cert chain trust status: info = 0, error = 1010040, chainSize = 1, SelfSigned = false
2020-7-24 20:46:11.007 Common.CryptoUtility 5688 6772 partial chain? certChain->rgpChain[0]->cElement = 1, certChain->rgpChain[0]->rgpElement[0]->TrustStatus.dwInfoStatus = 1
2020-7-24 20:46:11.007 Transport.SecurityContextSsl 5688 6772 incoming cert: thumbprint = 17..47, subject='....CN=ecgtest...', issuer='C=P.. O=Me.. CN=BL.. issuerCertThumbprint=, NotBefore=2019-06-07 18:34:23.000, NotAfter=2040-01-07 18:34:22.000
2020-7-24 20:46:11.007 Transport.SecurityContextSsl 5688 6772 CertVerifyCertificateChainPolicy(1) failed with policy status: 0x800b0109
2020-7-24 20:46:11.007 Transport.SecurityContextSsl 5688 6772 incoming cert thumbprint 17..47 matched, CertChainShouldBeVerified=false, shouldAcceptExpiredPinnedCert=false, certChainErrorStatus=0x800b0109
2020-7-24 20:46:11.008 HttpGateway.HttpGatewayRequestHandler 5688 6772 Dispatching URL https://10.100.19.5:19080/Nodes/_primary_1/$/HostedActiveCodePackage/eb..de2?api-version=3.0, from 10.100.19.5, operation: POST, Anonymous: false, Role: Admin, Body: null, ClientRequestId: 345....
And peer-to-peer/Federation-level authentication:
2020-7-24 20:50:04.496 Transport.SecurityContextSsl@20cce58a110 3644 10256 TryAuthenticateRemoteAsPeer: remote public key matches local: (alg = 1.2.840.113549.1.1.1, param = ptr=0x20cce431670, size=2, key = ptr=0x20c22ef6720, size=20, bytes=3082010a0282010100a8e23f1b783b6c2b006222), RoleMask = All
2020-7-24 20:50:04.496 Transport.SecurityContextSsl@20cce58a110 3644 10256 VerifyCertificate: remoteAuthenticatedAsPeer = true, incoming: tp='17...47', against: certThumbprintsToMatch='17...47', x509NamesToMatch='', certChainFlags=0x40000000, maskedFlags=0x40000000 shouldIgnoreCrlOffline=true shouldAcceptExpiredCert=false
2020-7-24 20:50:04.497 Transport.SecurityContextSsl@20cce58a110 3644 10256 usage: 1.3.6.1.5.5.7.3.2, cert chain trust status: info = 0, error = 1010040, chainSize = 1, SelfSigned = false
2020-7-24 20:50:04.497 Common.CryptoUtility 3644 10256 partial chain? certChain->rgpChain[0]->cElement = 1, certChain->rgpChain[0]->rgpElement[0]->TrustStatus.dwInfoStatus = 1
2020-7-24 20:50:04.497 Transport.SecurityContextSsl@20cce58a110 3644 10256 incoming cert: thumbprint = 17...47, subject='...CN=ecgtest...', issuer='...CN=BL...', issuerCertThumbprint=, NotBefore=2019-06-07 18:34:23.000, NotAfter=2040-01-07 18:34:22.000
2020-7-24 20:50:04.497 Transport.SecurityContextSsl@20cce58a110 3644 10256 CertVerifyCertificateChainPolicy(1) failed with policy status: 0x800b0109
2020-7-24 20:50:04.497 Transport.SecurityContextSsl@20cce58a110 3644 10256 VerifyCertificate: complete as remote is authenticated as peer and the chain has no fatal error
2020-7-24 20:50:04.497 Transport.SecurityContextSsl@20cce58a110 3644 10256 cert auth completed: IsClientRoleInEffect = false, RoleMask = All
Since the pre-validator works in isolation, it can't ignore the failures in chain building - which is why the 'CrlCheckingFlags' needs to gravitate more towards status masking (output of validation) rather than chain building (as input flags). In a sense, the pre-validator is stronger than the cluster's actual validation, and so we do not recommend it with non-standard PKIs.
I'm not sure what you meant about including the CA installation in the ARM templates; there is no mechanism in Azure for installing chain elements - they must be either part of the OS image, or explicitly installed by the cluster owner (via extensions such as Custom Script or Desired State Configuration).
Thank you for taking time to inspect that cluster and providing very valuable insight. This is a test cluster, fully isolated from our production environments, so I'm OK with discussing its configuration details.
About the sample ARM templates - when I wrote the above, I assumed that such a mechanism was in fact already available in Azure. Only later, after digging through the docs and examining the WAAppAgent and the Key Vault Extension internals, I realized that there is no support for such functionality and we will need to engineer our own solution, most likely using DSC.
Are there any plans to strengthen the cluster's validation rules to the level of the pre-validator in the future? In other words, how long can we safely keep our clusters running with EnforcePrevalidationOnSecurityChanges = false?
@dragav to comment on any known plans in this space at this time so we can close this thread out.
In general, if we make any breaking changes we do our best to make sure they are communicated well in advance.
This is the one email I've kept as unread since July. We have a tracking bug internally, and I need to make a fix. This will be included with the next release.
Service Fabric Runtime Version: 7.1.417.9590
Environment: Azure
Description and observed behavior:
SF 7.1 introduced security settings validation as part of preparing for a cluster configuration change. During this validation, SF (specifically, the
System.Fabric.Management.WindowsFabricValidator.SecuritySettingsValidator
class) builds a trust chain for all certificates used by the cluster. When the chain cannot be built, the validation fails and cluster config change is rejected, resulting in this error message:We encountered this problem in clusters we routinely set up and tear down during development. Our clusters use certificates issued from an internal CA and the certificates are specified in SF configuration by thumbprint. Until SF 7.1, there was no need to install the internal CA certificates on cluster nodes and SF worked fine with only the leaf certificates present.
Now we are observing the following behavior:
The discrepancy of behavior is the result of the Security/EnforcePrevalidationOnSecurityChanges parameter, which used to be false before, but now is set to true on new SF clusters in Azure. Fortunately, this setting can be set to false on a cluster already exhibiting this problem, but it is not easy to find.
The requirement to provide a full CA chain on cluster nodes for validation of cluster certificates is certainly a reasonable one, but there are two issues with its current implementation: 1) the requirement is not documented clearly anywhere in SF docs or SF 7.1 release notes, 2) the validation is not performed during cluster creation (when it would immediately come to user attention), rather, it only starts happening later, when the user attempts to upgrade or reconfigure the cluster.
Expected Behavior:
OS(Windows/Linux): Windows
Assignees: @microsoft/service-fabric-triage