Closed k-wall closed 7 months ago
Even if I rollback, I'm still seeing SRCFG00050: server.auto-configure in SysPropConfigSource does not map to any root
. I'm not understanding why.
@ozangunalp @robobario any idea?
I was invoking the integration-tests incorrectly in my local environment, so I was inadvertently running the image from a different build to the one I'd just built.
The integration tests are now passing for me on main.
Looking closer I see now its the testKerberosContainer
test that is failing on CI, but we don't have the integration logs to work out why. It is not failing for me locally. I've raised #122 in the of discovering why. Wonder if this could be Docker download limits?
2023-12-13T15:24:51.5178159Z [ERROR] Tests run: 10, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 112.0 s <<< FAILURE! -- in com.ozangunalp.kafka.test.container.KafkaNativeContainerIT 2023-12-13T15:24:51.5180949Z [ERROR] com.ozangunalp.kafka.test.container.KafkaNativeContainerIT.testKerberosContainer -- Time elapsed: 63.85 s <<< ERROR! 2023-12-13T15:24:51.5183807Z org.testcontainers.containers.ContainerLaunchException: Container startup failed for image quay.io/ogunalp/kafka-native:main-4a8689368ac5241e0ad9ac5f49cbe4ccd1f5c185 2023-12-13T15:24:51.5186253Z at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:362) 2023-12-13T15:24:51.5187378Z at org.testcontainers.containers.GenericContainer.start(GenericContainer.java:333) 2023-12-13T15:24:51.5189396Z at com.ozangunalp.kafka.test.container.KafkaNativeContainerIT.testKerberosContainer(KafkaNativeContainerIT.java:181) 2023-12-13T15:24:51.5191245Z at java.base/java.lang.reflect.Method.invoke(Method.java:568) 2023-12-13T15:24:51.5192498Z at java.base/java.util.ArrayList.forEach(ArrayList.java:1511) 2023-12-13T15:24:51.5193620Z at java.base/java.util.ArrayList.forEach(ArrayList.java:1511) 2023-12-13T15:24:51.5195041Z Caused by: org.rnorth.ducttape.RetryCountExceededException: Retry limit hit with exception 2023-12-13T15:24:51.5196948Z at org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:88) 2023-12-13T15:24:51.5198613Z at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:347) 2023-12-13T15:24:51.5199793Z ... 5 more 2023-12-13T15:24:51.5200937Z Caused by: org.testcontainers.containers.ContainerLaunchException: Could not create/start container 2023-12-13T15:24:51.5202601Z at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:566) 2023-12-13T15:24:51.5203746Z at org.testcontainers.containers.GenericContainer.lambda$doStart$0(GenericContainer.java:357) 2023-12-13T15:24:51.5205487Z at org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:81) 2023-12-13T15:24:51.5206664Z ... 6 more 2023-12-13T15:24:51.5207708Z Caused by: java.lang.IllegalStateException: Wait strategy failed. Container exited with code 1 2023-12-13T15:24:51.5209409Z at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:536) 2023-12-13T15:24:51.5210633Z ... 8 more 2023-12-13T15:24:51.5212352Z Caused by: org.testcontainers.containers.ContainerLaunchException: Timed out waiting for log output matching '.Kafka broker started.' 2023-12-13T15:24:51.5214817Z at org.testcontainers.containers.wait.strategy.LogMessageWaitStrategy.waitUntilReady(LogMessageWaitStrategy.java:47) 2023-12-13T15:24:51.5217077Z at org.testcontainers.containers.wait.strategy.AbstractWaitStrategy.waitUntilReady(AbstractWaitStrategy.java:52) 2023-12-13T15:24:51.5219523Z at org.testcontainers.containers.GenericContainer.waitUntilContainerStarted(GenericContainer.java:912) 2023-12-13T15:24:51.5221381Z at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:503)
I don't see that tag in quay from the above log: quay.io/ogunalp/kafka-native:main-4a8689368ac5241e0ad9ac5f49cbe4ccd1f5c185
, edit: oh, is that supposed to be something emitted locally only by the quarkus build, there are similar looking tags in quay though.
hmm, dockerhub rate limiting on the kerb container is a good guess but you'd think it would fail faster
One thing I've found is that if you modify the tests so that the Kerberos test runs multiple times consecutively, the test fails on the second run. It can be fixed by adding a sleep between kerberos server creation and starting the kafka server (as well as setting refreshKrb5Config=true
on the producer jaas config since the kerb client config is statically cached). Also fixed by using the default wait strategy to wait for the external listening ports to come up on the Kerb container.
It leads to the same failure message in the broker Caused by: java.net.PortUnreachableException
so I'm wondering if something similar has happened here. I think there is a period of time between the log Principal "admin/admin@EXAMPLE.COM" created.
and the ports listening. Somehow it's reliably longer when you are creating a second kerb container, maybe to do with replacing the established kerberos
service? Or it's somehow quicker to start the second kafka-native container. You could imagine some slow CI machine taking longer and hitting this race condition.
Have created a PR to demonstrate.
I don't have a mental model for Kerberos/Kafka. Does Kafka Broker coming up really require Kerberos to be ready? However - it looks like com.ozangunalp.kafka.test.container.KerberosContainer#KerberosContainer is organising for a wait until the Principal "admin/admin@EXAMPLE.COM" created.
message is written by the Kerberos container. So Kerberos should be ready otherwise we'd have seen a startup failure.
Let's get the container logs from CI. Maybe someone could review #122?
I don't see that tag in quay from the above log:
quay.io/ogunalp/kafka-native:main-4a8689368ac5241e0ad9ac5f49cbe4ccd1f5c185
, edit: oh, is that supposed to be something emitted locally only by the quarkus build, there are similar looking tags in quay though.
Yeah, it is an ephemeral image that never gets pushed to quay.
I didn't have time to look at this yesterday, taking a look now.
So I merged the #122 but the test no longer fails in the CI? Seems like it is flaky.
I can reproduce the failure locally with the following (stripped) stack:
Debug is true storeKey true useTicketCache false useKeyTab true doNotPrompt false ticketCache is null isInitiator true KeyTab is /opt/kafka/config/kafkabroker.keytab refreshKrb5Config is true principal is kafka/localhost@EXAMPLE.COM tryFirstPass is false useFirstPass is false storePass is false clearPass is false
Refreshing Kerberos configuration
[Krb5LoginModule] authentication failed
null
2023-12-14 03:39:04,178 ERROR [kaf.ser.BrokerServer] (main) [BrokerServer id=0] Fatal error during broker startup. Prepare to shutdown: org.apache.kafka.common.KafkaException: javax.security.auth.login.LoginException
at org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:184)
at org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:192)
at org.apache.kafka.common.network.ChannelBuilders.serverChannelBuilder(ChannelBuilders.java:107)
at com.ozangunalp.kafka.server.EmbeddedKafkaBroker.start(EmbeddedKafkaBroker.java:208)
Caused by: javax.security.auth.login.LoginException
at jdk.security.auth@17.0.8/com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:793)
at jdk.security.auth@17.0.8/com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:597)
at java.base@17.0.8/javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
at java.base@17.0.8/javax.security.auth.login.LoginContext$4.run(LoginContext.java:679)
at java.base@17.0.8/javax.security.auth.login.LoginContext$4.run(LoginContext.java:677)
at java.base@17.0.8/java.security.AccessController.executePrivileged(AccessController.java:144)
at java.base@17.0.8/java.security.AccessController.doPrivileged(AccessController.java:712)
at java.base@17.0.8/javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:677)
at java.base@17.0.8/javax.security.auth.login.LoginContext.login(LoginContext.java:587)
at org.apache.kafka.common.security.authenticator.AbstractLogin.login(AbstractLogin.java:60)
at org.apache.kafka.common.security.kerberos.KerberosLogin.login(KerberosLogin.java:103)
at org.apache.kafka.common.security.authenticator.LoginManager.<init>(LoginManager.java:62)
at org.apache.kafka.common.security.authenticator.LoginManager.acquireLoginManager(LoginManager.java:105)
at org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:170)
... 37 more
Caused by: java.net.PortUnreachableException
at org.graalvm.nativeimage.builder/com.oracle.svm.core.jni.functions.JNIFunctions$NewObjectWithObjectArrayArgFunctionPointer.invoke(JNIFunctions.java)
at org.graalvm.nativeimage.builder/com.oracle.svm.core.jni.functions.JNIFunctions.ThrowNew(JNIFunctions.java:882)
at java.base@17.0.8/sun.nio.ch.DatagramChannelImpl.receive0(DatagramChannelImpl.java)
at java.base@17.0.8/sun.nio.ch.DatagramChannelImpl.receiveIntoNativeBuffer(DatagramChannelImpl.java:750)
at java.base@17.0.8/sun.nio.ch.DatagramChannelImpl.receive(DatagramChannelImpl.java:728)
at java.base@17.0.8/sun.nio.ch.DatagramChannelImpl.trustedBlockingReceive(DatagramChannelImpl.java:699)
at java.base@17.0.8/sun.nio.ch.DatagramChannelImpl.blockingReceive(DatagramChannelImpl.java:633)
at java.base@17.0.8/sun.nio.ch.DatagramSocketAdaptor.receive(DatagramSocketAdaptor.java:240)
at java.base@17.0.8/java.net.DatagramSocket.receive(DatagramSocket.java:700)
at java.security.jgss@17.0.8/sun.security.krb5.internal.UDPClient.receive(NetClient.java:205)
at java.security.jgss@17.0.8/sun.security.krb5.KdcComm$KdcCommunication.run(KdcComm.java:409)
at java.security.jgss@17.0.8/sun.security.krb5.KdcComm$KdcCommunication.run(KdcComm.java:369)
at java.base@17.0.8/java.security.AccessController.executePrivileged(AccessController.java:144)
at java.base@17.0.8/java.security.AccessController.doPrivileged(AccessController.java:569)
at java.security.jgss@17.0.8/sun.security.krb5.KdcComm.send(KdcComm.java:353)
at java.security.jgss@17.0.8/sun.security.krb5.KdcComm.sendIfPossible(KdcComm.java:252)
at java.security.jgss@17.0.8/sun.security.krb5.KdcComm.send(KdcComm.java:228)
at java.security.jgss@17.0.8/sun.security.krb5.KdcComm.send(KdcComm.java:199)
at java.security.jgss@17.0.8/sun.security.krb5.KrbAsReqBuilder.send(KrbAsReqBuilder.java:345)
at java.security.jgss@17.0.8/sun.security.krb5.KrbAsReqBuilder.action(KrbAsReqBuilder.java:498)
at jdk.security.auth@17.0.8/com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:761)
... 50 more
I need some more time to look and remember how the GSSAPI authentication worked.
I created a PR to improve the stability of this test. I observed that kerberos didn't yet listen to the port when we started the Kafka broker.
With the last merge (pr #120) the integration tests have started to fail on CI.
https://github.com/ozangunalp/kafka-native/actions/runs/7196947318/job/19603932764#step:6:3963
I'm trying to reproduce the issue in my local environment.
I think the container is failing with an issue like this:(edit: removed misleading stack trace)
I haven't yet worked out why this system property is failing to be mapped to the bean.