Closed zerowebcorp closed 2 months ago
Additionally, upon testing with the previous chart version, the replication works. Confirms that the issue is with the 4.2.1 chart.
# works, openldap 2.6.3
helm upgrade --install openldap helm-openldap/openldap-stack-ha -f "4.1.2.yaml" --version 4.1.2
# fails, openldap 2.6.6
helm upgrade --install openldap helm-openldap/openldap-stack-ha -f "4.2.1.yaml" --version 4.2.1
Here are the full overrides of the yaml files
4.1.2.yaml
global:
ldapDomain: "example.com"
existingSecret: "dit-openldap-password"
replicaCount: 4
image:
repository: bitnami/openldap
tag: 2.6.3
logLevel: info
service:
ldapPortNodePort: 32010
sslLdapPortNodePort: 32011
type: NodePort
sessionAffinity: ClientIP
replication:
enabled: true
persistence:
enabled: true
existingClaim: openldap-dit-claim
accessModes:
- ReadWriteOnce
size: 8Gi
storageClass: "local-claim"
affinity:
podAntiAffinity:
# Add a hard requirement for each PD pod to be deployed to a different node
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values:
- openldap
topologyKey: "kubernetes.io/hostname"
# Add a soft requirement for each PD pod to be deployed to a different AZ
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values:
- openldap
topologyKey: "topology.kubernetes.io/region"
nodeSelector:
node.kubernetes.io/microk8s-worker: "microk8s-worker"
initContainers:
- name: volume-permissions
image: busybox
command: [ 'sh', '-c', 'chmod -R g+rwX /bitnami' ]
volumeMounts:
- mountPath: /bitnami
name: data
ltb-passwd:
enabled : false
phpldapadmin:
enabled: false
4.2.1.yaml
global:
ldapDomain: "example.com"
existingSecret: "dit-openldap-password"
replicaCount: 4
image:
repository: bitnami/openldap
tag: 2.6.6
logLevel: info
service:
ldapPortNodePort: 32010
sslLdapPortNodePort: 32011
type: NodePort
sessionAffinity: ClientIP
replication:
enabled: true
persistence:
enabled: true
existingClaim: openldap-dit-claim
accessModes:
- ReadWriteOnce
size: 8Gi
storageClass: "local-claim"
affinity:
podAntiAffinity:
# Add a hard requirement for each PD pod to be deployed to a different node
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values:
- openldap
topologyKey: "kubernetes.io/hostname"
# Add a soft requirement for each PD pod to be deployed to a different AZ
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values:
- openldap
topologyKey: "topology.kubernetes.io/region"
nodeSelector:
node.kubernetes.io/microk8s-worker: "microk8s-worker"
initContainers:
- name: volume-permissions
image: busybox
command: [ 'sh', '-c', 'chmod -R g+rwX /bitnami' ]
volumeMounts:
- mountPath: /bitnami
name: data
ltb-passwd:
enabled : false
phpldapadmin:
enabled: false
The only difference between the yaml is the openldap version ( 2.6.3 vs 2.6.6 ).
Hi @zerowebcorp
Can you please check with v4.2.2
?
Hi @zerowebcorp Can you please check with
v4.2.2
?
No, replication is still not working.
Which image
are you using?
@jp-gouin In case it's related I'm not seeing the change from https://github.com/jp-gouin/containers/commit/322298151c1940484eb4de45d44dc27df82f415f in jpgouin/openldap:2.6.6-fix which 4.2.2 uses. I also don't see it in the bitnami image even though it was seemingly merged in.
@parak Indeed it looks like it's not related, currently bitnami/openldap:2.6.6
has a change that breaks the chart.
I'm investigating to identify it and find a fix.
That is why I reverted the image to jpgouin/openldap:2.6.6-fix
which is working (used in the CI)
I run with ver 4.2.2 of the chart and image jpgouin/openldap:2.6.6-fix - it works fine.
The TLS error code in the logs above evolves from:
Do you by any chance use (the default) in chart values:
initTLSSecret:
tls_enabled: false
and let the init container create the TLS certs ? That configuration is only suitable for single node, multi-nodes need the same CA to establish TLS trust... Meaning you should create CA + TLS key + TLS cert and store those in a secret for all the nodes to use.
I am seeing the same issue regardless of the tls_enabled setting with replication and regardless of the image used. Its a fresh first time install and certs were generated using https://www.openldap.org/faq/data/cache/185.html
Replication fails to work with the following config for me. If I search the respective replicas for members of a group it will not show any for the second and third instance, but it shows it for the first. This is off a fresh install. Values are below.
resources:
limits:
cpu: "128m"
memory: "64Mi"
global:
ldapDomain: dc=spgrn,dc=com
existingSecret: ldap-admin
replicaCount: 3
env:
LDAP_SKIP_DEFAULT_TREE: "yes"
ltb-passwd:
enabled: false
persistence:
enabled: true
storageClass: ceph-filesystem
initTLSSecret:
tls_enabled: true
secret: ldap-tls-secret
replication:
enabled: true
# Enter the name of your cluster, defaults to "cluster.local"
clusterName: "cluster.local"
retry: 60
timeout: 1
interval: 00:00:00:10
starttls: "critical"
tls_reqcert: "never"
customSchemaFiles:
#enable memberOf ldap search functionality, users automagically track groups they belong to
00-memberof.ldif: |-
# Load memberof module
dn: cn=module,cn=config
cn: module
objectClass: olcModuleList
olcModuleLoad: memberof
olcModulePath: /opt/bitnami/openldap/lib/openldap
dn: olcOverlay=memberof,olcDatabase={2}mdb,cn=config
changetype: add
objectClass: olcOverlayConfig
objectClass: olcMemberOf
olcOverlay: memberof
olcMemberOfRefint: TRUE
customLdifFiles:
00-root.ldif: |-
# Root creation
dn: dc=spgrn,dc=com
objectClass: dcObject
objectClass: organization
o: spgrn
I'm also observing (with chart version 4.2.2 and default image jpgouin/openldap:2.6.6-fix
) replication errors with the default start_tls=critical
(this is my first use of this chart though, and I'm just learning ldap )
initTLSSecret:
#
tls_enabled: true
# The name of a kubernetes.io/tls type secret to use for TLS
secret: "openldap-tls"
The configured certicate seem valid to me: the ldaps:// client connections are properly accepted. The certificate properly include the FQDN used by the replication leveraging the headless service, and are properly validated by an openssl s_client -connect openldap-stack-ha-2.openldap-stack-ha-headless.10-openldap-ha.svc:1636
command.
I see no improvement when modifying start_tls=yes
or tls_reqcert=never
to tls_reqcert=allow
.
2024-05-16T13:23:08.484303481Z openldap-stack-ha-1 664608bc.1cd94290 0x7fcd922fc700 slap_client_connect: URI=ldap://openldap-stack-ha-0.openldap-stack-ha-headless.10-openldap-ha.svc.cluster.local:1389 Warning, ldap_start_tls failed (2)
2024-05-16T13:23:08.435147186Z openldap-stack-ha-0 664608bc.19ec3d8e 0x7f19277fe700 conn=1155 op=1 BIND dn="cn=admin,cn=config" method=128
2024-05-16T13:23:08.435164939Z openldap-stack-ha-0 664608bc.19ef200c 0x7f19277fe700 conn=1155 op=1 RESULT tag=97 err=53 qtime=0.000034 etime=0.000701 text=unauthenticated bind (DN with no password) disallowed
2024-05-16T13:23:08.486494116Z openldap-stack-ha-1 664608bc.1cf87157 0x7fcd922fc700 slap_client_connect: URI=ldap://openldap-stack-ha-0.openldap-stack-ha-headless.10-openldap-ha.svc.cluster.local:1389 DN="cn=admin,cn=config" ldap_sasl_bind_s failed (53)
2024-05-16T13:23:08.487243949Z openldap-stack-ha-1 664608bc.1cfc1a8d 0x7fcd922fc700 do_syncrepl: rid=001 rc 53 retrying
2024-05-16T13:23:08.436829570Z openldap-stack-ha-0 664608bc.1a084e8f 0x7f1927fff700 conn=1155 op=2 UNBIND
2024-05-16T13:23:08.437169313Z openldap-stack-ha-0 664608bc.1a0bc8eb 0x7f1927fff700 conn=1155 fd=16 closed
2024-05-16T13:23:08.498556019Z openldap-stack-ha-1 664608bc.1da2ccf3 0x7fcd92afd700 slap_client_connect: URI=ldap://openldap-stack-ha-2.openldap-stack-ha-headless.10-openldap-ha.svc.cluster.local:1389 DN="cn=adm in,cn=config" ldap_sasl_bind_s failed (53)
2024-05-16T13:23:08.498607331Z openldap-stack-ha-1 664608bc.1dab8522 0x7fcd92afd700 do_syncrepl: rid=003 rc 53 retrying
Increasing log levels show some additional errors
do_extended: unsupported operation "1.3.6.1.4.1.1466.20037"
(AFAIK indicates the start_tls upgrade directive failed)unauthenticated bind (DN with no password) disallowed
(does it indicates client TLS authentication was expected despite tls_reqcert=never
or tls_reqcert=allow
?)[pod/openldap-stack-ha-1/openldap-stack-ha] 2024-05-17T10:37:54.003667278Z 66473382.0032915e 0x7f33150fb700 TLS trace: SSL_accept:SSLv3/TLS write session ticket
[pod/openldap-stack-ha-1/openldap-stack-ha] 2024-05-17T10:37:54.003734247Z 66473382.0033f0ec 0x7f33150fb700 connection_read(16): unable to get TLS client DN, error=49 id=1342
pod/openldap-stack-ha-1/openldap-stack-ha] 2024-05-17T10:37:54.014160237Z 66473382.00d6d1cf 0x7f331dd09700 tls_read: want=5 error=Resource temporarily unavailable
[pod/openldap-stack-ha-1/openldap-stack-ha] 2024-05-17T10:37:54.014429444Z 66473382.00d96cfd 0x7f331dd09700 ldap_read: want=8 error=Resource temporarily unavailable
pod/openldap-stack-ha-1/openldap-stack-ha] 2024-05-17T10:37:54.019977684Z 66473382.012edf56 0x7f331dd09700 send_ldap_result: err=53 matched="" text="unauthenticated bind (DN with no password) disallowed"
[pod/openldap-stack-ha-1/openldap-stack-ha] 2024-05-17T10:37:54.020172425Z 66473382.0132768c 0x7f331dd09700 send_ldap_response: msgid=2 tag=97 err=53
Looking at the ldap replication doc at https://www.zytrax.com/books/ldap/ch6/#syncrepl for other workarounds, I could only spot the possibility to specify an explicit ldaps://
protocol in the replication url instead of relying on the start_tls to dynamically upgrade from plain connection to TLS.
Any other ideas for diagnostics or fix/workaround ?
Surprisingly:
openldap-stack-ha 668517c6.117f8772 0x7f249effd6c0 slap_client_connect: URI=ldap://openldap-stack-ha-2.openldapstack-ha-headless.10-openldap-ha.svc.cluster.local:1389 DN="cn=admin,cn=config" ldap_sasl_bind_s failed (53)
openldap-stack-ha 668517c6.1187c492 0x7f249effd6c0 do_syncrepl: rid=003 rc 53 retrying
This might just be polluting traces that should be ignored ?!
I tried to bump to chart openldap-stack-ha@4.2.5 (which still uses image pgouin/openldap:2.6.7-fix
by default) without improvements.
Besides, there seem to have polluting traces in the output due to the tcp probes configured in the helm which connects to the ldap daemon without sending payload.
openldap-stack-ha 668515c8.114b1efa 0x7f24a4b456c0 conn=1004 fd=13 ACCEPT from IP=10.42.3.1:33808 (IP=0.0.0.0:1389)
openldap-stack-ha 668515c8.1162d80c 0x7f249ffff6c0 conn=1004 fd=13 closed (connection lost)
I guess this could be avoided by using a probe command (using ldap client) instead of using a tcp probe at https://github.com/jp-gouin/helm-openldap/blob/17694f492ba7b7e3d1db44ab1dcfc828b516888e/templates/statefulset.yaml#L211-L213 or defining a custom probe command in the value.yaml
Double checking the current ci w.r.t. certs where the ca cert is generated at https://github.com/jp-gouin/helm-openldap/blob/850ca5b6375234663d122a8d53d10a27a5071869/.github/workflows/ci-ha.yml#L18-L20
The difference with my setting is that the custom ca.cert is distinct from the server certificate, however the following commands properly validate tls certs. I also mounted the ca.cert into /etc/ssl/certs/ using a custom volume
openssl s_client -connect openldap-stack-ha-2.openldap-stack-ha-headless.10-openldap-ha.svc:1636 -CAfile /etc/ssl/certs/ca-certificates.crt
openssl s_client -connect openldap-stack-ha-2.openldap-stack-ha-headless.10-openldap-ha.svc:1636 -CAfile /opt/bitnami/openldap/certs/ca.crt
depth=1 C = USA, O = Cloud Foundry, CN = internalCA
verify return:1
depth=0 CN = ldap-ha.internal.paas
verify return:1
DONE
...X509v3 Subject Alternative Name: DNS:elpaaso-ldap.internal.paas, DNS:ldap-ha.internal.paas, DNS:openldap-stack-ha.10-openldap-ha.svc, DNS:openldap-stack-ha.10-openldap-ha, DNS:openldap-stack-ha-0.openldap-stack-ha-headless.10-openldap-ha.svc.cluster.local, DNS:openldap-stack-ha-1.openldap-stack-ha-headless.10-openldap-ha.svc.cluster.local,DNS:openldap-stack-ha-2.openldap-stack-ha-headless.10-openldap-ha.svc.cluster.local
Which is in sync with the olcSyncrepl FQDN used
olcSyncrepl: {0}rid=001 provider=ldap://openldap-stack-ha-0.openldap-stack-ha-headless.10-openldap-ha.svc.cluster.local:1389 binddn="cn=admin,cn=config" bi
@jp-gouin Are you aware of setups where a custom self_signed CA (distinct from the server cert) is used and not such error replication logs are observed ?
Hi @gberche-orange
thanks for the probe hint , I’ll make sure to fix that in the upcoming update.
regarding your replication issue, to me as long as you have replication. tls_reqcert: "never"
the cert should not matter .
But yes I can also see some « pollution » in my logs which does not affect the replication.
maybe by properly handling the cert for all replicas using the proper SAN the pollution might disappear but that might not be an easy task and probably painful for users that wants to use their own certs
Thanks @jp-gouin for your prompt response !
I can also see some « pollution » in my logs which does not affect the replication.
This is good to hear. Would you mind sharing some extracts to confirm they match what I as reporting ?
regarding your replication issue, to me as long as you have replication. tls_reqcert: "never" the cert should not matter .
Reading the documentation below, I'm concerned that setting tls_reqcert: "never"
will result into the client not authenticating the server through its certificate, and hence be vulnerable to man-in-the-middle attacks through spoofing the server IP. This might however be hard to exploit in the case of a k8s deployment with the headless service used as FQDN used for replication. I'll try to test the option and confirm this is enough to have the polluting logs go away. Edit: tls_reqcert: "never"
is the default value that my tests were run with and show reported polluting logs.
Did you ever consider supporting an option in the chart to allow use of an explicit ldaps:// protocol to the ldap-tls 1636 port in the replication url instead of relying on the ldap port 1389 with the start_tls
directive to dynamically upgrade from plain connection to TLS (as suggested in https://github.com/jp-gouin/helm-openldap/issues/148#issuecomment-2117541757 ) ?
maybe by properly handling the cert for all replicas using the proper SAN the pollution might disappear
On my setting, despites the SAN including all replicas as illustrated below, the pollution log is still there. Can you think of missing SANs I should try to add ?
DNS:openldap-stack-ha-0.openldap-stack-ha-headless.10-openldap-ha.svc.cluster.local, DNS:openldap-stack-ha-1.openldap-stack-ha-headless.10-openldap-ha.svc.cluster.local,DNS:openldap-stack-ha-2.openldap-stack-ha-headless.10-openldap-ha.svc.cluster.local
Indeed ldaps
for the replication was my first option back in the day when I created the chart. But I never managed to have it properly working so I used start_tls
to still have an encrypted communication.
I agree with you that this is not man in the middle proof , might be worth trying again ...
If you want to try and submit a PR that would be greatly appreciated 😀
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
des news ?
i'm also experiencing replication issues, but TLS does not seem to be the problem. --> it is, read edit
I noticed that when deploying a fresh cluster, openldap-0 pod inits fine, no crashes, but all other pods e.g. openldap-1 will crash with the following error during init:
66df1395.020ee154 0x7f9966ba06c0 conn=1013 op=1 ADD dn="cn=module,cn=config"
66df1395.0211fca5 0x7f9966ba06c0 module_load: (ppolicy.so) already loaded
66df1395.02130b59 0x7f9966ba06c0 olcModuleLoad: value #0: <olcModuleLoad> handler exited with 1!
66df1395.02143821 0x7f9966ba06c0 conn=1013 op=1 RESULT tag=105 err=80 qtime=0.000057 etime=0.000426 text=<olcModuleLoad> handler exited with 1
ldap_add: Other (e.g., implementation specific) error (80)
additional info: <olcModuleLoad> handler exited with 1
adding new entry "cn=module,cn=config"
It seems that this blocks the replica from properly initializing and any writes into this replica will not be sync'd into the other replicas. Writes to openldap-0 are properly replicated though.
EDIT:
Actually it seems that the issue is indeed related to TLS. The issue may be caused by the crash previously mentioned (not clear).
It seems that openldap-0 (the first pod being initialized) has the path for CA configured:
kubectl exec -n keycloak-iam openldap-0 -it -- bash -c "grep -rn ca.crt /bitnami"
Defaulted container "openldap-stack-ha" out of: openldap-stack-ha, init-schema (init), init-tls-secret (init)
/bitnami/openldap/slapd.d/cn=config.ldif:20:olcTLSCACertificateFile: /opt/bitnami/openldap/certs/ca.crt
if we have a look at any other pod, nothing:
kubectl exec -n keycloak-iam openldap-1 -it -- bash -c "grep -rn ca.crt /bitnami"
Defaulted container "openldap-stack-ha" out of: openldap-stack-ha, init-schema (init), init-tls-secret (init)
command terminated with exit code 1
So these pods have no idea where to fetch the CA -> errors.
This setting is indeed set in the initialization script: https://github.com/bitnami/containers/blob/deb6cea75770638735e164915b4bfd6add27860e/bitnami/openldap/2.6/debian-12/rootfs/opt/bitnami/scripts/libopenldap.sh#L735
So I think this chart or the docker images used need some patching to avoid containers from crashing in the init scripts...
Mitigation in the chart, edit the command for the openldap container:
command:
- sh
- -c
- |
host=$(hostname)
if [ "$host" = "{{ template "openldap.fullname" . }}-0" ]
then
echo "This is the first openldap pod so let's init all additional schemas and ldifs here"
else
echo "This is not the first openldap pod so let's not init anything"
# unset configurations that are cluster-wide and should not be re-applied
unset LDAP_CONFIGURE_PPOLICY LDAP_PPOLICY_HASH_CLEARTEXT
# do not attempt to create default tree as it is already creadted by pod 0
export LDAP_SKIP_DEFAULT_TREE=yes
fi
/opt/bitnami/scripts/openldap/entrypoint.sh /opt/bitnami/scripts/openldap/run.sh
Hello, I did try this chart a weeks/a month ago on azure aks and didn't observe this issue, and trying it this week on a new bare metal k8s cluster gives me this error. I noticed that the chart has been upgraded to new versions and a lot has been changed.
when deploying a new openldap gives me this error
Steps to replicate:
I used version 4.2.1 which uses openldap 2.6.6 current version.
The following values are the user-supplied values.
This created 2 openldap pods, I logged into each pod and verified that the changes are not replicating. The error log shows the error shown above.
Logs from pod-0