vmware-archive / lightwave

Identity services for traditional infrastructure, applications and containers.
Other
323 stars 100 forks source link

system journal fills with errors: _VmDirConsumePartner failed, error code (51) #39

Open oddboy opened 5 years ago

oddboy commented 5 years ago

Describe the bug Once lightwave is installed on two domain controllers, the logs fill with errors like:

Dec 31 21:53:49 dc01 vmdird[1532]: t@139860643936000: _VmDirConsumePartner failed, error code (51) Dec 31 21:54:10 dc01 vmdird[1532]: t@139860643936000: _VmDirConsumePartner failed, error code (51) Dec 31 21:54:34 dc01 vmdird[1532]: t@139860643936000: _VmDirConsumePartner failed, error code (51) Dec 31 21:54:36 dc01 vmdird[1532]: t@139860643936000: _VmDirConsumePartner failed, error code (51) Dec 31 21:55:38 dc01 vmdird[1532]: t@139860643936000: _VmDirConsumePartner failed, error code (51)

From looking at the source code, it appears that error code 51 is "LDAP SERVER BUSY". I have found some references to this pertaining to OpenLDAP and Microsoft's AD LDAP implementation. However, neither domain controller is overly busy, CPU utilization less than 5% generally with 0.2 load average.

Impact

Expected behavior Unclear. I can't tell if there is a problem or not. Changes to the directory (like create a user) seem to work correctly and get replicated to peers.

Observed behavior The error seems to imply that there is a communication problem between domain controller nodes. The by product of this is that the logs are filled with these messages about every 40 seconds or so. There are times during which this message is not logged, but those often occur when the systems (in the "lab") are not being used for much of anything, so I'm not sure if these log lines are indication of an actual problem.

To Reproduce Steps to reproduce the behavior:

  1. Install and configure lightwave on first node.
  2. Replace default certificates with enterprise signed CA
  3. Add a secondary node to domain
  4. Check logs, see error

Environment:

Additional context As noted above, the only deviation from a very default install in my case was that I replaced the default certificate on the first node with one signed by our enterprise CA. Note however that the same problem occurs even with a fully default configuration, that is, with no customizations at all.

oddboy commented 5 years ago

additional logging seems to reveal that this error is caused by a failure of CLDAP queries, aka "LDAP Pings". For reference, https://ldapwiki.com/wiki/LDAP%20ping.

Jan 04 04:34:24 dc01 vmdird[1168]: t@140192521856768: VmDirSendLdapResponse: code (97), Error (14)() Jan 04 04:34:24 dc01 vmdird[1168]: t@140192521856768: VmDirSendLdapResponse: code (97), Error (14)() Jan 04 04:34:24 dc01 vmdird[1168]: t@140192521856768: Request Control: 1.3.6.1.4.1.4203.1.9.1.1 Jan 04 04:34:25 dc01 vmdird[1168]: t@140193663264512: _VmDirConsumePartner failed, error code (51) Jan 04 04:34:25 dc01 vmdird[1168]: t@140192521856768: VmDirSendLdapResult: Request (Search), Error (0), Message (), (0) socket (192.168.3.118)

the OID in the above log extract is the one OpenLDAP uses for these CLDAP messages. From what I understand, these are meant to be done over UDP.

However, it appears that on my lightwave installation, nothing is listening on UDP/389.

So, it appears as if any functions that depend on the status of these CLDAP pings fail because nothing is listending on UDP/389.

is there a way to get vmdird to listen on UDP?

oddboy commented 5 years ago

a packet capture that shows a SRV DNS querie for _ldap._tcp.dc._msdcs., followed by a series of A and AAAA lookups for the listed DCs, then a connection attempt from the client trying to join to both of the DCs on UDP/389, followed by ICMP unreachable messages indicating that there's nothing accepting connections on the DC's on UDP/389.

screen shot here

so, I guess it's a likewise problem rather than a lightwave / vmdir issue?

not sure where to go from here...

sruo commented 5 years ago

Describe the bug Once lightwave is installed on two domain controllers, the logs fill with errors like:

Dec 31 21:53:49 dc01 vmdird[1532]: t@139860643936000: _VmDirConsumePartner failed, error code (51) Dec 31 21:54:10 dc01 vmdird[1532]: t@139860643936000: _VmDirConsumePartner failed, error code (51) Dec 31 21:54:34 dc01 vmdird[1532]: t@139860643936000: _VmDirConsumePartner failed, error code (51) Dec 31 21:54:36 dc01 vmdird[1532]: t@139860643936000: _VmDirConsumePartner failed, error code (51) Dec 31 21:55:38 dc01 vmdird[1532]: t@139860643936000: _VmDirConsumePartner failed, error code (51)

Hi Oddboy: Above error is from consumer is benign. In this version, when a node is pulling changes from a partner, it will send LDAP_BUSY(51) when receiving a replication search request.

If you upgrade to latest Lightwave version, this problem no longer exits. Let me know if you have further questions.

Thanks.

sruo commented 5 years ago

"

So, it appears as if any functions that depend on the status of these CLDAP pings fail because nothing is listending on UDP/389.

is there a way to get vmdird to listen on UDP? " Current version of vmdir does not listen on UPD port. Which feature/function in your use case requires CLDAP (UDP) ping?

oddboy commented 5 years ago

I can see that the commands lw-get-dc-name and domainjoin-cli appear to use the CLDAP queries as these fail with error:

Failed to query time on domain controller. Error code 2453 (NERR_DCNotFound).

I'm not sure what other effect this might be having