Open martinpitt opened 9 months ago
I tried to create a reproducer that runs on a standard RHEL 9.4 cloud image (as that's where it fails most often, but it also fails on C8S, Fedora 39, etc.).
First, some prep:
systemctl stop firewalld
hostnamectl set-hostname x0.cockpit.lan
logout
# log back in to pick up changed host name
Set up the Samba container:
cat <<EOF > /tmp/samba-ad.json
{
"samba-container-config": "v0",
"configs": {
"demo": {
"instance_features": ["addc"],
"domain_settings": "sink",
"instance_name": "smb"
}
},
"domain_settings": {
"sink": {
"realm": "COCKPIT.LAN",
"short_domain": "COCKPIT",
"admin_password": "foobarFoo123"
}
}
}
EOF
SERVER_IP=$(ip route show | grep -oP 'src \K\S+' | head -n1)
# necessary?
# echo "$SERVER_IP x0.cockpit.lan x0" >> /etc/hosts
podman run -d --rm --name samba --privileged \
-p $SERVER_IP:53:53/udp -p 389:389 -p 389:389/udp -p 445:445 \
-p 88:88 \
-p 88:88/udp \
-p 135:135 \
-p 137-138:137-138/udp \
-p 139:139 \
-p 464:464 \
-p 464:464/udp \
-p 636:636 \
-p 1024-1044:1024-1044 \
-p 3268-3269:3268-3269 \
-v /tmp/samba-ad.json:/etc/samba/container.json \
-h smb.cockpit.lan \
quay.io/samba.org/samba-ad-server
nmcli con mod 'System eth0' ipv4.ignore-auto-dns yes ipv4.dns $SERVER_IP
systemctl restart NetworkManager
# echo "nameserver $SERVER_IP" > /etc/resolv.conf
# wait until server is running
until nslookup -type=SRV _ldap._tcp.cockpit.lan; do sleep 1; done
until nc -z $SERVER_IP 389; do sleep 1; done
# add AD user
podman exec -i samba samba-tool user add alice foobarFoo123
Now the AD client side:
printf '[cockpit.lan]\nfully-qualified-names = no\n' > /etc/realmd.conf
# this should see up COCKPIT.LAN
realm discover
# cockpit.lan type kerberos, client-software: sssd, etc
echo foobarFoo123 | realm join -vU Administrator cockpit.lan
This succeeds.
id alice
fails, and sssctl domain-status cockpit.lan
is in a semi-broken state: It says "Online" (instead of "offline" as our test does), but it still cannot find the global catalog:
Online status: Online
Active servers:
AD Global Catalog: not connected
AD Domain Controller: smb.cockpit.lan
Discovered AD Global Catalog servers:
None so far.
Discovered AD Domain Controller servers:
- smb.cockpit.lan
The sssd log is rather empty:
# cat /var/log/sssd/sssd_cockpit.lan.log
(2023-11-20 6:07:28): [be[cockpit.lan]] [server_setup] (0x3f7c0): Starting with debug level = 0x0070
all other log files look similar.
So this clearly does not reproduce the actual flake/error, but I'm lost here. Do you have a hint how to fix this CLI reproducer? Once it works in general, I hope I can make it flake/error like our actual test (which is hard to debug as there are so many moving parts).
Thanks!
CC: @gd
We moved back to https://github.com/Fmstrat/samba-domain a while ago, and while that made it better, we still see that bug a lot. As this happens on two completely different OSes/samba packaging (Fedora and Ubuntu), this looks like a regression in samba itself. Our current container has Samba 4.15.13. But I still have no idea where to go from here.
We moved back to https://github.com/Fmstrat/samba-domain a while ago, and while that made it better, we still see that bug a lot.
I'm sorry to hear that. Both for the change and for the issue.
As this happens on two completely different OSes/samba packaging (Fedora and Ubuntu), this looks like a regression in samba itself. Our current container has Samba 4.15.13. But I still have no idea where to go from here.
It is certainly possible.
We build images tagged nightly
that include nightly builds of samba master. Could you try quay.io/samba.org/samba-ad-server:nightly
and see if the issue occurs there too? If so, we may want to report the issue at the samba bugzilla.
Also sorry for the lack of response ealier. I saw this issue when I was on vacation and pinged my manager at work hoping he'd have someone else look into it. But I guess not and from my POV it fell through the cracks.
No worries at all @phlogistonjohn ! Thanks for the hint, I'll try the nightly image, in January (this is EOY for me as well). Happy holidays!
I've been debugging a big Cockpit AD test flake for three days now, and still can't put my finger on it, so maybe you have an idea. This started to fail since we moved from https://github.com/Fmstrat/samba-domain/ to https://quay.io/repository/samba.org/samba-ad-server , i.e. the client side didn't change. What this test does is roughly this:
f0.cockpit.lan
), with exporting all portsx0.cockpit.lan
withrealmd
,adcli
and such.alice
user in Samba AD on "services"alice
user is visible, i.e.id alice
succeeds.This works most of the time. After joining:
But in about 10% of local runs and 50% of runs in CI, it looks like this:
and /var/log/sssd/sssd_cockpit.lan.log has a similar error:
This is a race condition -- I can gradually strip down the test until it doesn't involve Cockpit at all any more -- the only effect that it has is to cause some I/O and CPU noise (like packagekit checking for updates). I can synthesize this with client-side commands like this:
This is cockpit test API lingo, but
m.execute
just runs a shell command on the client VM, whilem.spawn()
runs it in the background.Do you happen to have any idea to investigate further what exactly fails here?