ozangunalp / kafka-native

Kafka broker compiled to native using Quarkus and GraalVM.
Apache License 2.0
95 stars 13 forks source link

integration tests have started to fail - testKerberosContainer #121

Closed k-wall closed 7 months ago

k-wall commented 7 months ago

With the last merge (pr #120) the integration tests have started to fail on CI.

https://github.com/ozangunalp/kafka-native/actions/runs/7196947318/job/19603932764#step:6:3963

Error:  Tests run: 10, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 112.0 s <<< FAILURE! -- in com.ozangunalp.kafka.test.container.KafkaNativeContainerIT
Error:  com.ozangunalp.kafka.test.container.KafkaNativeContainerIT.testKerberosContainer -- Time elapsed: 63.85 s <<< ERROR!
org.testcontainers.containers.ContainerLaunchException: Container startup failed for image quay.io/ogunalp/kafka-native:main-4a8689368ac5241e0ad9ac5f49cbe4ccd1f5c185

I'm trying to reproduce the issue in my local environment.

I think the container is failing with an issue like this:

(edit: removed misleading stack trace)

I haven't yet worked out why this system property is failing to be mapped to the bean.

k-wall commented 7 months ago

Even if I rollback, I'm still seeing SRCFG00050: server.auto-configure in SysPropConfigSource does not map to any root. I'm not understanding why.

k-wall commented 7 months ago

@ozangunalp @robobario any idea?

k-wall commented 7 months ago

I was invoking the integration-tests incorrectly in my local environment, so I was inadvertently running the image from a different build to the one I'd just built.

The integration tests are now passing for me on main.

k-wall commented 7 months ago

Looking closer I see now its the testKerberosContainer test that is failing on CI, but we don't have the integration logs to work out why. It is not failing for me locally. I've raised #122 in the of discovering why. Wonder if this could be Docker download limits?

2023-12-13T15:24:51.5178159Z [ERROR] Tests run: 10, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 112.0 s <<< FAILURE! -- in com.ozangunalp.kafka.test.container.KafkaNativeContainerIT 2023-12-13T15:24:51.5180949Z [ERROR] com.ozangunalp.kafka.test.container.KafkaNativeContainerIT.testKerberosContainer -- Time elapsed: 63.85 s <<< ERROR! 2023-12-13T15:24:51.5183807Z org.testcontainers.containers.ContainerLaunchException: Container startup failed for image quay.io/ogunalp/kafka-native:main-4a8689368ac5241e0ad9ac5f49cbe4ccd1f5c185 2023-12-13T15:24:51.5186253Z at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:362) 2023-12-13T15:24:51.5187378Z at org.testcontainers.containers.GenericContainer.start(GenericContainer.java:333) 2023-12-13T15:24:51.5189396Z at com.ozangunalp.kafka.test.container.KafkaNativeContainerIT.testKerberosContainer(KafkaNativeContainerIT.java:181) 2023-12-13T15:24:51.5191245Z at java.base/java.lang.reflect.Method.invoke(Method.java:568) 2023-12-13T15:24:51.5192498Z at java.base/java.util.ArrayList.forEach(ArrayList.java:1511) 2023-12-13T15:24:51.5193620Z at java.base/java.util.ArrayList.forEach(ArrayList.java:1511) 2023-12-13T15:24:51.5195041Z Caused by: org.rnorth.ducttape.RetryCountExceededException: Retry limit hit with exception 2023-12-13T15:24:51.5196948Z at org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:88) 2023-12-13T15:24:51.5198613Z at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:347) 2023-12-13T15:24:51.5199793Z ... 5 more 2023-12-13T15:24:51.5200937Z Caused by: org.testcontainers.containers.ContainerLaunchException: Could not create/start container 2023-12-13T15:24:51.5202601Z at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:566) 2023-12-13T15:24:51.5203746Z at org.testcontainers.containers.GenericContainer.lambda$doStart$0(GenericContainer.java:357) 2023-12-13T15:24:51.5205487Z at org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:81) 2023-12-13T15:24:51.5206664Z ... 6 more 2023-12-13T15:24:51.5207708Z Caused by: java.lang.IllegalStateException: Wait strategy failed. Container exited with code 1 2023-12-13T15:24:51.5209409Z at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:536) 2023-12-13T15:24:51.5210633Z ... 8 more 2023-12-13T15:24:51.5212352Z Caused by: org.testcontainers.containers.ContainerLaunchException: Timed out waiting for log output matching '.Kafka broker started.' 2023-12-13T15:24:51.5214817Z at org.testcontainers.containers.wait.strategy.LogMessageWaitStrategy.waitUntilReady(LogMessageWaitStrategy.java:47) 2023-12-13T15:24:51.5217077Z at org.testcontainers.containers.wait.strategy.AbstractWaitStrategy.waitUntilReady(AbstractWaitStrategy.java:52) 2023-12-13T15:24:51.5219523Z at org.testcontainers.containers.GenericContainer.waitUntilContainerStarted(GenericContainer.java:912) 2023-12-13T15:24:51.5221381Z at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:503)

robobario commented 7 months ago

I don't see that tag in quay from the above log: quay.io/ogunalp/kafka-native:main-4a8689368ac5241e0ad9ac5f49cbe4ccd1f5c185, edit: oh, is that supposed to be something emitted locally only by the quarkus build, there are similar looking tags in quay though.

robobario commented 7 months ago

hmm, dockerhub rate limiting on the kerb container is a good guess but you'd think it would fail faster

robobario commented 7 months ago

One thing I've found is that if you modify the tests so that the Kerberos test runs multiple times consecutively, the test fails on the second run. It can be fixed by adding a sleep between kerberos server creation and starting the kafka server (as well as setting refreshKrb5Config=true on the producer jaas config since the kerb client config is statically cached). Also fixed by using the default wait strategy to wait for the external listening ports to come up on the Kerb container.

It leads to the same failure message in the broker Caused by: java.net.PortUnreachableException so I'm wondering if something similar has happened here. I think there is a period of time between the log Principal "admin/admin@EXAMPLE.COM" created. and the ports listening. Somehow it's reliably longer when you are creating a second kerb container, maybe to do with replacing the established kerberos service? Or it's somehow quicker to start the second kafka-native container. You could imagine some slow CI machine taking longer and hitting this race condition.

Have created a PR to demonstrate.

k-wall commented 7 months ago

I don't have a mental model for Kerberos/Kafka. Does Kafka Broker coming up really require Kerberos to be ready? However - it looks like com.ozangunalp.kafka.test.container.KerberosContainer#KerberosContainer is organising for a wait until the Principal "admin/admin@EXAMPLE.COM" created. message is written by the Kerberos container. So Kerberos should be ready otherwise we'd have seen a startup failure.

Let's get the container logs from CI. Maybe someone could review #122?

k-wall commented 7 months ago

I don't see that tag in quay from the above log: quay.io/ogunalp/kafka-native:main-4a8689368ac5241e0ad9ac5f49cbe4ccd1f5c185, edit: oh, is that supposed to be something emitted locally only by the quarkus build, there are similar looking tags in quay though.

Yeah, it is an ephemeral image that never gets pushed to quay.

ozangunalp commented 7 months ago

I didn't have time to look at this yesterday, taking a look now.

ozangunalp commented 7 months ago

So I merged the #122 but the test no longer fails in the CI? Seems like it is flaky.

I can reproduce the failure locally with the following (stripped) stack:

Debug is  true storeKey true useTicketCache false useKeyTab true doNotPrompt false ticketCache is null isInitiator true KeyTab is /opt/kafka/config/kafkabroker.keytab refreshKrb5Config is true principal is kafka/localhost@EXAMPLE.COM tryFirstPass is false useFirstPass is false storePass is false clearPass is false
Refreshing Kerberos configuration
        [Krb5LoginModule] authentication failed 
null
2023-12-14 03:39:04,178 ERROR [kaf.ser.BrokerServer] (main) [BrokerServer id=0] Fatal error during broker startup. Prepare to shutdown: org.apache.kafka.common.KafkaException: javax.security.auth.login.LoginException
    at org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:184)
    at org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:192)
    at org.apache.kafka.common.network.ChannelBuilders.serverChannelBuilder(ChannelBuilders.java:107)
    at com.ozangunalp.kafka.server.EmbeddedKafkaBroker.start(EmbeddedKafkaBroker.java:208)
Caused by: javax.security.auth.login.LoginException
    at jdk.security.auth@17.0.8/com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:793)
    at jdk.security.auth@17.0.8/com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:597)
    at java.base@17.0.8/javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
    at java.base@17.0.8/javax.security.auth.login.LoginContext$4.run(LoginContext.java:679)
    at java.base@17.0.8/javax.security.auth.login.LoginContext$4.run(LoginContext.java:677)
    at java.base@17.0.8/java.security.AccessController.executePrivileged(AccessController.java:144)
    at java.base@17.0.8/java.security.AccessController.doPrivileged(AccessController.java:712)
    at java.base@17.0.8/javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:677)
    at java.base@17.0.8/javax.security.auth.login.LoginContext.login(LoginContext.java:587)
    at org.apache.kafka.common.security.authenticator.AbstractLogin.login(AbstractLogin.java:60)
    at org.apache.kafka.common.security.kerberos.KerberosLogin.login(KerberosLogin.java:103)
    at org.apache.kafka.common.security.authenticator.LoginManager.<init>(LoginManager.java:62)
    at org.apache.kafka.common.security.authenticator.LoginManager.acquireLoginManager(LoginManager.java:105)
    at org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:170)
    ... 37 more
Caused by: java.net.PortUnreachableException
    at org.graalvm.nativeimage.builder/com.oracle.svm.core.jni.functions.JNIFunctions$NewObjectWithObjectArrayArgFunctionPointer.invoke(JNIFunctions.java)
    at org.graalvm.nativeimage.builder/com.oracle.svm.core.jni.functions.JNIFunctions.ThrowNew(JNIFunctions.java:882)
    at java.base@17.0.8/sun.nio.ch.DatagramChannelImpl.receive0(DatagramChannelImpl.java)
    at java.base@17.0.8/sun.nio.ch.DatagramChannelImpl.receiveIntoNativeBuffer(DatagramChannelImpl.java:750)
    at java.base@17.0.8/sun.nio.ch.DatagramChannelImpl.receive(DatagramChannelImpl.java:728)
    at java.base@17.0.8/sun.nio.ch.DatagramChannelImpl.trustedBlockingReceive(DatagramChannelImpl.java:699)
    at java.base@17.0.8/sun.nio.ch.DatagramChannelImpl.blockingReceive(DatagramChannelImpl.java:633)
    at java.base@17.0.8/sun.nio.ch.DatagramSocketAdaptor.receive(DatagramSocketAdaptor.java:240)
    at java.base@17.0.8/java.net.DatagramSocket.receive(DatagramSocket.java:700)
    at java.security.jgss@17.0.8/sun.security.krb5.internal.UDPClient.receive(NetClient.java:205)
    at java.security.jgss@17.0.8/sun.security.krb5.KdcComm$KdcCommunication.run(KdcComm.java:409)
    at java.security.jgss@17.0.8/sun.security.krb5.KdcComm$KdcCommunication.run(KdcComm.java:369)
    at java.base@17.0.8/java.security.AccessController.executePrivileged(AccessController.java:144)
    at java.base@17.0.8/java.security.AccessController.doPrivileged(AccessController.java:569)
    at java.security.jgss@17.0.8/sun.security.krb5.KdcComm.send(KdcComm.java:353)
    at java.security.jgss@17.0.8/sun.security.krb5.KdcComm.sendIfPossible(KdcComm.java:252)
    at java.security.jgss@17.0.8/sun.security.krb5.KdcComm.send(KdcComm.java:228)
    at java.security.jgss@17.0.8/sun.security.krb5.KdcComm.send(KdcComm.java:199)
    at java.security.jgss@17.0.8/sun.security.krb5.KrbAsReqBuilder.send(KrbAsReqBuilder.java:345)
    at java.security.jgss@17.0.8/sun.security.krb5.KrbAsReqBuilder.action(KrbAsReqBuilder.java:498)
    at jdk.security.auth@17.0.8/com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:761)
    ... 50 more

I need some more time to look and remember how the GSSAPI authentication worked.

ozangunalp commented 7 months ago

I created a PR to improve the stability of this test. I observed that kerberos didn't yet listen to the port when we started the Kafka broker.