Occasionally clients can't discover AD Global Catalog server

martinpitt commented 9 months ago

I've been debugging a big Cockpit AD test flake for three days now, and still can't put my finger on it, so maybe you have an idea. This started to fail since we moved from https://github.com/Fmstrat/samba-domain/ to https://quay.io/repository/samba.org/samba-ad-server , i.e. the client side didn't change. What this test does is roughly this:

Start one "services" VM with a samba-ad-server podman container (called f0.cockpit.lan), with exporting all ports
Start one "client/cockpit" VM x0.cockpit.lan with realmd, adcli and such.
Create an alice user in Samba AD on "services"
On the client, join the domain, and wait until the alice user is visible, i.e. id alice succeeds.

This works most of the time. After joining:

# sssctl domain-status cockpit.lan
Online status: Online

Active servers:
AD Global Catalog: f0.cockpit.lan
AD Domain Controller: f0.cockpit.lan

But in about 10% of local runs and 50% of runs in CI, it looks like this:

Online status: Offline

Active servers:
AD Global Catalog: not connected
AD Domain Controller: cockpit.lan

and /var/log/sssd/sssd_cockpit.lan.log has a similar error:

   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_get_account_info_send] (0x0200): Got request for [0x1][BE_REQ_USER][name=alice@cockpit.lan]
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_attach_req] (0x0400): [RID#5] DP Request [Account #5]: REQ_TRACE: New request. [sssd.nss CID #4] Flags [0x0001].
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_attach_req] (0x0400): [RID#5] [CID #4] Backend is offline! Using cached data if available
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_attach_req] (0x0400): [RID#5] Number of active DP request: 1
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [sss_domain_get_state] (0x1000): [RID#5] Domain cockpit.lan is Active
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [_dp_req_recv] (0x0400): DP Request [Account #5]: Receiving request data.
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_req_destructor] (0x0400): DP Request [Account #5]: Request removed.
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [dp_req_destructor] (0x0400): Number of active DP request: 0
   *  (2023-11-17  0:47:14): [be[cockpit.lan]] [sbus_issue_request_done] (0x0040): sssd.dataprovider.getAccountInfo: Error [1432158212]: SSSD is offline
********************** BACKTRACE DUMP ENDS HERE *********************************

(2023-11-17  0:47:15): [be[cockpit.lan]] [ad_sasl_log] (0x0040): [RID#6] SASL: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Server krbtgt/LAN@COCKPIT.LAN not found in Kerberos database)
   *  ... skipping repetitive backtrace ...
(2023-11-17  0:47:15): [be[cockpit.lan]] [sasl_bind_send] (0x0020): [RID#6] ldap_sasl_interactive_bind_s failed (-2)[Local error]
   *  ... skipping repetitive backtrace ...
(2023-11-17  0:47:15): [be[cockpit.lan]] [sdap_cli_connect_recv] (0x0040): [RID#6] Unable to establish connection [1432158227]: Authentication Failed
   *  ... skipping repetitive backtrace ...
(2023-11-17  0:47:19): [be[cockpit.lan]] [resolv_gethostbyname_done] (0x0040): querying hosts database failed [5]: Input/output error
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:

This is a race condition -- I can gradually strip down the test until it doesn't involve Cockpit at all any more -- the only effect that it has is to cause some I/O and CPU noise (like packagekit checking for updates). I can synthesize this with client-side commands like this:

        m.write("/etc/realmd.conf", "[cockpit.lan]\nfully-qualified-names = no\n", append=True)
        m.spawn("for i in $(seq 10); do grep -r . /usr >&2; done", "noise")
        time.sleep(1)
        self.assertIn("cockpit.lan", m.execute("realm discover"))
        m.execute(f"echo '{self.admin_password}' | realm join -vU {self.admin_user} cockpit.lan")m
        m.execute('while ! id alice; do sleep 5; done', timeout=300)

This is cockpit test API lingo, but m.execute just runs a shell command on the client VM, while m.spawn() runs it in the background.

Do you happen to have any idea to investigate further what exactly fails here?

martinpitt commented 9 months ago

I tried to create a reproducer that runs on a standard RHEL 9.4 cloud image (as that's where it fails most often, but it also fails on C8S, Fedora 39, etc.).

First, some prep:

systemctl stop firewalld
hostnamectl set-hostname x0.cockpit.lan
logout
# log back in to pick up changed host name

Set up the Samba container:

cat <<EOF > /tmp/samba-ad.json
{
  "samba-container-config": "v0",
  "configs": {
    "demo": {
      "instance_features": ["addc"],
      "domain_settings": "sink",
      "instance_name": "smb"
    }
  },
  "domain_settings": {
    "sink": {
      "realm": "COCKPIT.LAN",
      "short_domain": "COCKPIT",
      "admin_password": "foobarFoo123"
    }
  }
}
EOF

SERVER_IP=$(ip route show | grep -oP 'src \K\S+' | head -n1)

# necessary?
# echo "$SERVER_IP x0.cockpit.lan x0" >> /etc/hosts

podman run -d --rm --name samba --privileged \
    -p $SERVER_IP:53:53/udp -p 389:389 -p 389:389/udp -p 445:445 \
    -p 88:88 \
    -p 88:88/udp \
    -p 135:135 \
    -p 137-138:137-138/udp \
    -p 139:139 \
    -p 464:464 \
    -p 464:464/udp \
    -p 636:636 \
    -p 1024-1044:1024-1044 \
    -p 3268-3269:3268-3269 \
    -v /tmp/samba-ad.json:/etc/samba/container.json \
    -h smb.cockpit.lan \
    quay.io/samba.org/samba-ad-server

nmcli con mod 'System eth0' ipv4.ignore-auto-dns yes ipv4.dns $SERVER_IP
systemctl restart NetworkManager
# echo "nameserver $SERVER_IP" > /etc/resolv.conf

# wait until server is running
until nslookup -type=SRV _ldap._tcp.cockpit.lan; do sleep 1; done
until nc -z $SERVER_IP 389; do sleep 1; done

# add AD user
podman exec -i samba samba-tool user add alice foobarFoo123

Now the AD client side:

printf '[cockpit.lan]\nfully-qualified-names = no\n' > /etc/realmd.conf
# this should see up COCKPIT.LAN
realm discover
# cockpit.lan type kerberos, client-software: sssd, etc

echo foobarFoo123 | realm join -vU Administrator cockpit.lan

This succeeds.

id alice fails, and sssctl domain-status cockpit.lan is in a semi-broken state: It says "Online" (instead of "offline" as our test does), but it still cannot find the global catalog:

Online status: Online

Active servers:
AD Global Catalog: not connected
AD Domain Controller: smb.cockpit.lan

Discovered AD Global Catalog servers:
None so far.
Discovered AD Domain Controller servers:
- smb.cockpit.lan

The sssd log is rather empty:

# cat /var/log/sssd/sssd_cockpit.lan.log
(2023-11-20  6:07:28): [be[cockpit.lan]] [server_setup] (0x3f7c0): Starting with debug level = 0x0070

all other log files look similar.

So this clearly does not reproduce the actual flake/error, but I'm lost here. Do you have a hint how to fix this CLI reproducer? Once it works in general, I hope I can make it flake/error like our actual test (which is hard to debug as there are so many moving parts).

Thanks!

phlogistonjohn commented 9 months ago

CC: @gd

martinpitt commented 8 months ago

We moved back to https://github.com/Fmstrat/samba-domain a while ago, and while that made it better, we still see that bug a lot. As this happens on two completely different OSes/samba packaging (Fedora and Ubuntu), this looks like a regression in samba itself. Our current container has Samba 4.15.13. But I still have no idea where to go from here.

phlogistonjohn commented 8 months ago

We moved back to https://github.com/Fmstrat/samba-domain a while ago, and while that made it better, we still see that bug a lot.

I'm sorry to hear that. Both for the change and for the issue.

As this happens on two completely different OSes/samba packaging (Fedora and Ubuntu), this looks like a regression in samba itself. Our current container has Samba 4.15.13. But I still have no idea where to go from here.

It is certainly possible.

We build images tagged nightly that include nightly builds of samba master. Could you try quay.io/samba.org/samba-ad-server:nightly and see if the issue occurs there too? If so, we may want to report the issue at the samba bugzilla.

Also sorry for the lack of response ealier. I saw this issue when I was on vacation and pinged my manager at work hoping he'd have someone else look into it. But I guess not and from my POV it fell through the cracks.

martinpitt commented 8 months ago

No worries at all @phlogistonjohn ! Thanks for the hint, I'll try the nightly image, in January (this is EOY for me as well). Happy holidays!

samba-in-kubernetes / samba-container

Occasionally clients can't discover AD Global Catalog server #160