quarkusio / quarkus

Quarkus: Supersonic Subatomic Java.
https://quarkus.io
Apache License 2.0
13.68k stars 2.65k forks source link

kafka-client with SASL aws-msk-iam-auth #39548

Closed jvdadda closed 5 months ago

jvdadda commented 6 months ago

Describe the bug

When using the aws-msk-iam-auth library, with native build, I am unable to connect to kafka broker, I have an exception java.io.IOException: Channel could not be created for socket java.nio.channels.SocketChannel[closed]

All is working as expected with jvm build.

Expected behavior

Like jvm build, the producer should be able to connect to broker and then produce a message, here is the logs when it works:

2024-03-15 20:12:43,653 DEBUG [sof.ama.msk.aut.iam.int.MSKCredentialProvider] (executor-thread-1) Number of options to configure credential provider 1
2024-03-15 20:12:43,664 DEBUG [sof.ama.msk.aut.iam.IAMLoginModule] (executor-thread-1) IAMLoginModule initialized
2024-03-15 20:12:43,665 INFO  [org.apa.kaf.com.sec.aut.AbstractLogin] (executor-thread-1) Successfully logged in.
2024-03-15 20:12:43,670 DEBUG [org.apa.kaf.com.sec.ssl.DefaultSslEngineFactory] (executor-thread-1) Created SSL context with keystore null, truststore null, provider SunJSSE.
2024-03-15 20:12:43,887 DEBUG [org.apa.kaf.com.sec.aut.SaslClientAuthenticator] (kafka-producer-network-thread | producer-1) [Producer clientId=producer-1] Set SASL client state to SEND_APIVERSIONS_REQUEST
2024-03-15 20:12:43,896 DEBUG [org.apa.kaf.com.sec.aut.SaslClientAuthenticator] (kafka-producer-network-thread | producer-1) [Producer clientId=producer-1] Creating SaslClient: client=null;service=kafka;serviceHostname=b-3.***.kafka.eu-central-1.amazonaws.com;mechs=[AWS_MSK_IAM]
2024-03-15 20:12:43,906 DEBUG [sof.ama.msk.aut.iam.int.IAMSaslClient] (kafka-producer-network-thread | producer-1) Setting SASL/AWS_MSK_IAM.750044075 client state to SEND_CLIENT_FIRST_MESSAGE
2024-03-15 20:12:43,909 DEBUG [org.apa.kaf.com.net.Selector] (kafka-producer-network-thread | producer-1) [Producer clientId=producer-1] Created socket with SO_RCVBUF = 32768, SO_SNDBUF = 131072, SO_TIMEOUT = 0 to node -3
2024-03-15 20:12:43,928 DEBUG [jdk.eve.security] (kafka-producer-network-thread | producer-1) X509Certificate: Alg:SHA256withRSA, Serial:bbe59968a757f8b5534c461887dafaa, Subject:CN=*.***.kafka.eu-central-1.amazonaws.com, Issuer:CN=Amazon RSA 2048 M03, O=Amazon, C=US, Key type:RSA, Length:2048, Cert Id:183160222, Valid from:2/20/24, 12:00 AM, Valid until:3/20/25, 11:59 PM
<-- LOT OF jdk.eve.security STUFF -->
2024-03-15 20:12:44,181 DEBUG [jdk.eve.security] (kafka-producer-network-thread | producer-1)  TLSHandshake: b-3.***.kafka.eu-central-1.amazonaws.com:9098, TLSv1.2, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384, 183160222
2024-03-15 20:12:44,182 DEBUG [org.apa.kaf.com.net.SslTransportLayer] (kafka-producer-network-thread | producer-1) [SslTransportLayer channelId=-3 key=channel=java.nio.channels.SocketChannel[connection-pending remote=b-3.***.kafka.eu-central-1.amazonaws.com/10.192.21.231:9098], selector=sun.nio.ch.EPollSelectorImpl@5a086f5d, interestOps=8, readyOps=0] SSL handshake completed successfully with peerHost 'b-3.***.kafka.eu-central-1.amazonaws.com' peerPort 9098 peerPrincipal 'CN=*.***.kafka.eu-central-1.amazonaws.com' protocol 'TLSv1.2' cipherSuite 'TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384'
2024-03-15 20:12:44,257 DEBUG [org.apa.kaf.com.sec.aut.SaslClientAuthenticator] (kafka-producer-network-thread | producer-1) [Producer clientId=producer-1] Set SASL client state to RECEIVE_APIVERSIONS_RESPONSE
2024-03-15 20:12:44,276 DEBUG [org.apa.kaf.com.sec.aut.SaslClientAuthenticator] (kafka-producer-network-thread | producer-1) [Producer clientId=producer-1] Set SASL client state to SEND_HANDSHAKE_REQUEST
2024-03-15 20:12:44,277 DEBUG [org.apa.kaf.com.sec.aut.SaslClientAuthenticator] (kafka-producer-network-thread | producer-1) [Producer clientId=producer-1] Set SASL client state to RECEIVE_HANDSHAKE_RESPONSE
2024-03-15 20:12:44,279 DEBUG [org.apa.kaf.com.sec.aut.SaslClientAuthenticator] (kafka-producer-network-thread | producer-1) [Producer clientId=producer-1] Set SASL client state to INITIAL
2024-03-15 20:12:44,280 DEBUG [sof.ama.msk.aut.iam.int.IAMSaslClient] (kafka-producer-network-thread | producer-1) State SEND_CLIENT_FIRST_MESSAGE at start of evaluating challenge
2024-03-15 20:12:44,281 DEBUG [sof.ama.msk.aut.iam.IAMClientCallbackHandler] (kafka-producer-network-thread | producer-1) Type information for callback: class: software.amazon.msk.auth.iam.internals.AWSCredentialsCallback classloader: io.quarkus.bootstrap.runner.RunnerClassLoader@2cb4c3ab from class: software.amazon.msk.auth.iam.IAMClientCallbackHandler classloader: io.quarkus.bootstrap.runner.RunnerClassLoader@2cb4c3ab
2024-03-15 20:12:44,282 DEBUG [sof.ama.msk.aut.iam.IAMClientCallbackHandler] (kafka-producer-network-thread | producer-1) Selecting provider software.amazon.msk.auth.iam.internals.MSKCredentialProvider to load credentials
...

Actual behavior

Here is the complete logs, including the exception:

2024-03-15 19:57:36,738 DEBUG [sof.ama.msk.aut.iam.int.MSKCredentialProvider] (executor-thread-1) Number of options to configure credential provider 1
2024-03-15 19:57:36,739 DEBUG [sof.ama.msk.aut.iam.IAMLoginModule] (executor-thread-1) IAMLoginModule initialized
2024-03-15 19:57:36,740 INFO  [org.apa.kaf.com.sec.aut.AbstractLogin] (executor-thread-1) Successfully logged in.
2024-03-15 19:57:36,740 DEBUG [org.apa.kaf.com.sec.ssl.DefaultSslEngineFactory] (executor-thread-1) Created SSL context with keystore null, truststore null, provider SunJSSE.
2024-03-15 19:57:36,741 DEBUG [org.apa.kaf.com.sec.aut.SaslClientAuthenticator] (kafka-producer-network-thread | producer-1) [Producer clientId=producer-1] Set SASL client state to SEND_APIVERSIONS_REQUEST
2024-03-15 19:57:36,742 DEBUG [org.apa.kaf.com.sec.aut.SaslClientAuthenticator] (kafka-producer-network-thread | producer-1) [Producer clientId=producer-1] Creating SaslClient: client=null;service=kafka;serviceHostname=b-3.***.kafka.eu-central-1.amazonaws.com;mechs=[AWS_MSK_IAM]

2024-03-15 19:57:36,743 WARN  [org.apa.kaf.cli.NetworkClient] (kafka-producer-network-thread | producer-1) [Producer clientId=producer-1] Error connecting to node b-3.***.kafka.eu-central-1.amazonaws.com:9098 (id: -3 rack: null): java.io.IOException: Channel could not be created for socket java.nio.channels.SocketChannel[closed]
    at org.apache.kafka.common.network.Selector.buildAndAttachKafkaChannel(Selector.java:348)
    at org.apache.kafka.common.network.Selector.registerChannel(Selector.java:329)
    at org.apache.kafka.common.network.Selector.connect(Selector.java:256)
    at org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:1032)
    at org.apache.kafka.clients.NetworkClient.access$600(NetworkClient.java:73)
    at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1203)
    at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1091)
    at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:569)
    at org.apache.kafka.clients.NetworkClientUtils.isReady(NetworkClientUtils.java:42)
    at org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:64)
    at org.apache.kafka.clients.producer.internals.Sender.awaitNodeReady(Sender.java:562)
    at org.apache.kafka.clients.producer.internals.Sender.maybeSendAndPollTransactionalRequest(Sender.java:483)
    at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:331)
    at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:246)
    at java.base@21.0.2/java.lang.Thread.runWith(Thread.java:1596)
    at java.base@21.0.2/java.lang.Thread.run(Thread.java:1583)
    at org.graalvm.nativeimage.builder/com.oracle.svm.core.thread.PlatformThreads.threadStartRoutine(PlatformThreads.java:833)
    at org.graalvm.nativeimage.builder/com.oracle.svm.core.posix.thread.PosixPlatformThreads.pthreadStartRoutine(PosixPlatformThreads.java:211)
Caused by: org.apache.kafka.common.KafkaException: org.apache.kafka.common.errors.SaslAuthenticationException: Failed to configure SaslClientAuthenticator
    at org.apache.kafka.common.network.SaslChannelBuilder.buildChannel(SaslChannelBuilder.java:239)
    at org.apache.kafka.common.network.Selector.buildAndAttachKafkaChannel(Selector.java:338)
    ... 17 more
Caused by: org.apache.kafka.common.errors.SaslAuthenticationException: Failed to configure SaslClientAuthenticator
Caused by: org.apache.kafka.common.errors.SaslAuthenticationException: Failed to create SaslClient with mechanism AWS_MSK_IAM

How to Reproduce?

Use this project : https://github.com/jvdadda/debug-aws-msk-iam-auth You need to replace a value in application.properties with the list of endpoints of a public MSK cluster with IAM enabled, then replace in Makefile your credentials that can access to the cluster.

Then, do a make build/native start or make build/jvm start

Output of uname -a or ver

Darwin mob-mac-53aa52 23.2.0 Darwin Kernel Version 23.2.0: Wed Nov 15 21:55:06 PST 2023; root:xnu-10002.61.3~2/RELEASE_ARM64_T6020 arm64

Output of java -version

openjdk version "21.0.2" 2024-01-16 OpenJDK Runtime Environment (build 21.0.2+13-58) OpenJDK 64-Bit Server VM (build 21.0.2+13-58, mixed mode, sharing)

Mandrel or GraalVM version (if different from Java)

Same on Mandrel and GraalVM

Quarkus version or git rev

3.8.2

Build tool (ie. output of mvnw --version or gradlew --version)

------------------------------------------------------------
Gradle 8.5
------------------------------------------------------------

Build time:   2023-11-29 14:08:57 UTC
Revision:     28aca86a7180baa17117e0e5ba01d8ea9feca598

Kotlin:       1.9.20
Groovy:       3.0.17
Ant:          Apache Ant(TM) version 1.10.13 compiled on January 4 2023
JVM:          21.0.2 (Oracle Corporation 21.0.2+13-58)
OS:           Mac OS X 14.2.1 aarch64

Additional information

No response

quarkus-bot[bot] commented 6 months ago

/cc @Karm (mandrel), @alesj (kafka), @cescoffier (kafka), @galderz (mandrel), @geoand (kotlin), @ozangunalp (kafka), @zakkak (mandrel)

jvdadda commented 6 months ago

I can privately share temp AWS credentials and MSK endpoints

galderz commented 6 months ago

@jvdadda I can try. Email me at first name at redhat com.

galderz commented 6 months ago

I've found the problem. The issue is related to this methods:

public String[] getMechanismNames(Map<String, ?> props) {
    return new String[]{IAMSaslClient.getMechanismNameForClassLoader(this.getClass().getClassLoader())};
}

And:

public static String getMechanismNameForClassLoader(ClassLoader classLoader) {
    return "AWS_MSK_IAM." + classLoader.hashCode();
}

The security providers ClassLoaderAwareIAMSaslClientProvider and IAMSaslClientProvider are being registered at build time via the static block of software.amazon.msk.auth.iam.IAMLoginModule, so their mechanism names will be baked with the classloader hashcode value at build time. The issue is that at runtime that hashcode has a different value. The impact of this is that ClassLoaderAwareIAMSaslClientFactory, whose mech name is not dependant on hashcode is located fine, but when it comes to IAMSaslClientFactory it fails because of the methods above, which create a mech name based on a different value to that at build time.

Thankfully there's a very easy workaround that you can apply. Build the native with:

-Dquarkus.native.additional-build-args=--initialize-at-run-time=software.amazon.msk.auth.iam.IAMLoginModule

Building it like this will make IAMLoginModule runtime initialized, making the security provides be registered with the hashcode at run time. Then it all works.

I will check with those maintaining the AWS integration to make the above permanent.

jvdadda commented 6 months ago

Wow thanks @galderz, for the explanation, and for the fix ! Maintainers of the library are not really reactive, I made lot of modifications to make it work until this problem (that's why I included the jar, and not the dependency).

I will try to fix it on my fork and propose a PR on the main repo

galderz commented 6 months ago

... Maintainers of the library are not really reactive, I made lot of modifications to make it work until this problem (that's why I included the jar, and not the dependency).

Which maintainers are you talking about? Amazon AWS or the ones that contributed https://github.com/quarkiverse/quarkus-amazon-services? Do you have any links to public issues/discussions that have gone unanswered?

I will try to fix it on my fork and propose a PR on the main repo

I have created https://github.com/quarkiverse/quarkus-amazon-services/issues/1204 to get this fix permanently. The changes required are small and there are plenty of examples on how to do it in the main https://github.com/quarkusio/quarkus repo.

jvdadda commented 6 months ago

Which maintainers are you talking about?

I talk about AWS (https://github.com/aws/aws-msk-iam-auth), the library still uses the AWS v1 SDK that create lot of problems (mainly Random class used in static fields) and that have not support about native compilation (contrary to AWS SDK v2). The project I give you have a modified library jar (https://github.com/jvdadda/debug-aws-msk-iam-auth/tree/main/libs) with v2 usage (PR in the project are opened without answers).

If you replace https://github.com/jvdadda/debug-aws-msk-iam-auth/blob/main/build.gradle.kts#L27 with:

implementation("software.amazon.msk:aws-msk-iam-auth:2.0.3")

my example project will not work anymore.

Do you have any links to public issues/discussions that have gone unanswered?