SSL requests hang when returning a CompletableFuture

mikaelmercur commented 10 months ago

Describe the bug

Requests hang when an endpoint is returning a CompletableFuture and SSL is active. When turning off SSL or the endpoint is changed to not return a CompletableFuture, the requests seem not to hang. The requests hang for the time specified in quarkus.http.idle-timeout.

Expected behavior

Requests should not hang.

Actual behavior

Requests hang.

How to Reproduce?

hanging-ssl.zip

Unzip the attached zip-archive
Run the project
Do a GET request to http://localhost:8080/test/count (here http does not matter)
Notice that "Request completed" is never printed (well, after quarkus.http.idle-timeout it is, but that defaults to 30m)
If changing the code such that the endpoint test/bytes is used internally, "Request Completed" is eventually printed

The example code uses OkHttpClient, but it is possible to reproduce with java.net.HttpClient aswell, but that seems to require more iterations.

Output of `uname -a` or `ver`

No response

Output of `java -version`

No response

Quarkus version or git rev

3.6.1

Build tool (ie. output of `mvnw --version` or `gradlew --version`)

gradle

Additional information

No response

cescoffier commented 9 months ago

Are you sure that the OKHttp is intended to be used in parallel (concurrently) like this? It works as soon as I remove parallel.

Also, it works perfectly with the Vert.x Web Client:

public DataResource(Vertx vertx) {
        client = WebClient.create(vertx, new WebClientOptions().setSsl(true).setTrustAll(true));
    }

    @GET
    @Path("count")
    @Produces(MediaType.TEXT_PLAIN)
    public int getCount(@DefaultValue("100") @QueryParam("iterations") int iterations) {
        System.out.println(Instant.now() + " Request started...");
        return Multi.createFrom().range(0, iterations)
                .onItem().transformToUni(i -> client.getAbs("https://localhost:8443/test/bytesAsync").send())
                .merge(100)
                .map(response -> response.body().length())
                .collect().asList()
                .map(l -> l.stream().mapToInt(i -> i).sum())
                .await().indefinitely();
}

mikaelmercur commented 9 months ago

@cescoffier, OkHttpClient is thread safe and you are supposed to share a single client for all threads. Same for the Java HttpClient. You are correct that the problem is tied to parallel requests, but also to Http/2 and to use of SSL. The vertx web client seems to be a bit slower (at least in my testing) compared to java's http client and ok httpclient, which could indicate that the vertx client doesn't use as much parallellism as the others (?). I still believe it is on the Quarkus side, since it hangs both with OkHttp and Java's http client. Also, downgrading Quarkus to 2.16.12 (as we use in production) the probability of getting the hang is way lower. hanging-ssl-2.zip is a slightly modified version of the initial project where you can try with OkHttpClient, Java's HttpClient and the Vertx Webclient by using the query parameter "client" with value "ok" for OkHttpClient, "java" for the Java HttpClient and "vertx" for the vertx webclient. For me, only the vertx client passes that test.

cescoffier commented 9 months ago

The web client concurrency is configurable. You can increase it. I just used the default.

Now, there is not much we can do in Quarkus. It seems to be a client issue.

I will have a look to the new reproducer, however, I would recommend using the vertx client or the rest client. At least they integrate correctly with the I/O mechanism of Quarkus.

mikaelmercur commented 9 months ago

If it is the client side, then why would it differ so much between Quarkus 2 and Quarkus 3? If you downgrade Quarkus to 2.16 in hanging-ssl-2 (you have to set the version in gradle.properties and fix a couple of imports) you can see that there is no problem running the example with default settings (java http client, 100 iterations). Changing to OkHttpClient works fine as well with 100 iterations. In Quarkus 3 (for me at least) the program always get stuck with the java client and the okhttpclient in less than 100 iterations.

mikaelmercur commented 9 months ago

Also, changing the endpoint to return Data instead of CompletableFuture\<Data> makes the requests not hang anymore with either of the clients. You can test that by changing the uri from "https://localhost:8443/test/bytesAsync" to "https://localhost:8443/test/bytes"

cescoffier commented 9 months ago

There is another issue in the reproducer. Using parallel() schedules tasks on the common fork/join pool. On the server side (running on the same JVM), you use supplyAsync which schedules tasks on the common fork/join pool. So, basically, you are exhausting it.

If you do:

@GET
    @Path("bytesAsync")
    @ResponseStatus(200)
    @Produces({"application/json", "application/cbor"}) // Not really what is produced...
    public Uni<Data> getBytesAsync() {
        return Uni.createFrom().item(DATA).emitOn(Infrastructure.getDefaultExecutor());
    }

Then it works (because only the client uses the F/J).

mikaelmercur commented 9 months ago

Removed the parallel() part in the client code, it is still hangning with java http client and ok http client, see hanging-ssl-3.zip. This is, of course, not how our real application is constructed (we have a client and a server). This is just a simple example, with similar preconditions as our real application, that is hanging in a similar way.

cescoffier commented 9 months ago

I would heavily recommend to NOT use supplyAsync without an executor. Relying on the f/j can be problematic (especially in containers). Why it's exhausted in this case is unclear. It might deadlock somewhere. I'm pretty sure it can be reproduced outside Quarkus (as soon as you have the concurrent access on the ssl context or things like this).

I have provided several workaround as comments. As mentioned I would recommend using the clients provided by Quarkus which avoid using the f/j, and integrate correctly with the I/O layer.

mikaelmercur commented 9 months ago

If I change the endpoint to:

public CompletableFuture<Data> getBytesAsync() {
        return CompletableFuture.completedFuture(DATA);
}

it still hangs. As I said, this is not what we do in our real application, but the example behaves in a similar way.

cescoffier commented 9 months ago

Try with Uni instead of CompletableFuture.

cescoffier commented 9 months ago

I just tried with:

 @GET
    @Path("bytesAsync")
    @ResponseStatus(200)
    @Produces({"application/json", "application/cbor"}) // Not really what is produced...
    public CompletableFuture<Data> getBytesAsync() {
        return CompletableFuture.completedFuture(DATA);
    }

with the 3 clients:

…/quarkus-llm-workshop | Git  bump-0.7.0 [✓ ] | Java v21.0.1 took 2s
🕝 07:47:54 ➜ http ":8080/test/count?client=JAVA&iterations=10000"
HTTP/1.1 200 OK
Content-Type: text/plain;charset=UTF-8
content-length: 10

1000000000

…/quarkus-llm-workshop | Git  bump-0.7.0 [✓ ] | Java v21.0.1 took 39s
🕝 07:50:53 ➜ http ":8080/test/count?client=OK&iterations=10000"
HTTP/1.1 200 OK
Content-Type: text/plain;charset=UTF-8
content-length: 10

1000000000

…/quarkus-llm-workshop | Git  bump-0.7.0 [✓ ] | Java v21.0.1 took 37s
🕝 07:51:52 ➜ http ":8080/test/count?client=VERTX&iterations=10000"
HTTP/1.1 200 OK
Content-Type: text/plain;charset=UTF-8
content-length: 10

1000000000

No problem at all.

However, I'm using Quarkus 999-VERSION. Please check with 3.7 as it may have already been fixed.

mikaelmercur commented 9 months ago

With 3.7.0.CR7 it still freezes for me. However, I had to increase the number of iterations to 10000 or run the example twice with just 100 iterations to make the probability of a hang as high as before. I can still get it first try with 100 iterations sometimes though using the java client.

Can you run the following without freeze: curl "http://localhost:8080/test/count?client=java&iterations=10000"

And can you do the following twice without a freeze: curl "http://localhost:8080/test/count?client=java"

A lot of our server-side code is built upon CompletableFuture, that's why it is convenient to just make the endpoint return it. But to convert the endpoint to return a Uni I guess we can do something similar to:

@GET
@Path("bytesAsync")
@ResponseStatus(200)
@Produces({"application/json", "application/cbor"}) // Not really what is produced...
public Uni<Data> getBytesAsync() {
  return Uni.createFrom()
          .completionStage(CompletableFuture.completedFuture(DATA))
          .emitOn(Infrastructure.getDefaultExecutor());
}

However, this also hangs for me.

Edit: I changed the endpoint as suggested to:

@GET
@Path("bytesAsync")
@ResponseStatus(200)
@Produces({"application/json", "application/cbor"}) // Not really what is produced...
public Uni<Data> getBytesAsync() {
    return Uni.createFrom().item(DATA).emitOn(Infrastructure.getDefaultExecutor());
}

On the 11:th try the following call got a hang: curl "http://localhost:8080/test/count?client=java&iterations=1000"

cescoffier commented 9 months ago

Ok, I will try to make it hang on my machine - so far no luck. I need to generate a thread dump when it happens.

mikaelmercur commented 9 months ago

I refactored my client so that it runs in a different process now. threads.txt is a thread dump from Quarkus during the hang. Hopefully that will help. The thread dump is from using the Uni endpoint as you suggested (without CompletableFuture at all). client_threads.txt is a thread dump from the client.

cescoffier commented 9 months ago

Hum, no deadlock. Can you run it without the dev mode?

mikaelmercur commented 9 months ago

threadsNotDevMode.txt is a thread dump from when not running in dev mode.

cescoffier commented 9 months ago

So, I'm dug a bit more. It seems to be something in resteasy reactive (server). I was unable to get it to hang with a plain route.

cescoffier commented 9 months ago

Next step... seems to be due to the serialization of the content when emitted from a Uni or CompletionStage.

cescoffier commented 9 months ago

Can you try with this serializer?

@Provider
@Produces({"application/json", "application/cbor"})
public class DataBodyWriter implements ServerMessageBodyWriter<Data> {

    @Override
    public boolean isWriteable(Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
        return Data.class.isAssignableFrom(type);
    }

    @Override
    public void writeTo(Data data, Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType, MultivaluedMap<String, Object> httpHeaders, OutputStream entityStream) throws IOException, WebApplicationException {
        entityStream.write(data.bytes());
    }

    @Override
    public boolean isWriteable(Class<?> type, Type genericType, ResteasyReactiveResourceInfo target, MediaType mediaType) {
        return true;
    }

    @Override
    public void writeResponse(Data o, Type genericType, ServerRequestContext context) throws WebApplicationException, IOException {
        context.serverResponse().end(o.bytes());
    }
}

Not sure what's the problem yet.

mikaelmercur commented 9 months ago

That actually seems to solve the problem, and might be a feasible workaroud for our problem in our real application. I will test that as well.

mikaelmercur commented 9 months ago

It seems to solve the problem in our real application as well. However, since ServerMessageBodyWriter extends MessageBodyWriter, can we be sure that this workaroud covers all code paths, or can there still be cases where we might get the hang?

cescoffier commented 9 months ago

Thanks for the feedback! It's a workaround, I need to check exactly what's going on.

The new class is what we use internally. So, it's safe. But it should not be required, so something is definitely off.

cescoffier commented 9 months ago

Ok, I believe I understand the issue. @geoand, fasten your seat belt.

So, in this context, the endpoint is executed on the event loop (because it either returns a Uni or a CompletionStage). The application contains its own serializer.

In ServerSerialisers, the invokeWriter detect whether the serializer is a ServerMessageBodyWriter (that's the workaround posted above) or a plain MessageBodyWriter. This latter case is faulty in this context. It needs to be clarified what is going on. I do not see deadlocks or event loops blocked.

Looking at the code, I'm not sure the MessageBodyWriter should be called from the event loop. @geoand is it allowed to invoke that on the event loop? I can't see anything blocking, however, because we pass null as handlers, we may not orchestrate the calls correctly. They should all be enqueued in the same event loop queue. However, something weird happens.

geoand commented 9 months ago

@geoand is it allowed to invoke that on the event loop?

Yeah, there is no real reason why they can't

geoand commented 9 months ago

How can I try reproduce the problem?

cescoffier commented 9 months ago

I'm using the second reproducer.

Update the gradle.properties to:

#Gradle properties
quarkusPluginId=io.quarkus
quarkusPluginVersion=999-SNAPSHOT
quarkusPlatformGroupId=io.quarkus
quarkusPlatformArtifactId=quarkus-bom
quarkusPlatformVersion=999-SNAPSHOT

Run the app (gradle quarkusDev)
In a terminal, run: http ":8080/test/count?client=JAVA&iterations=10000"

geoand commented 9 months ago

Thanks!

I'll have a look tomorrow

cescoffier commented 9 months ago

The weird things is:

works with routes (so not a HTTP layer issue)
works with blocking (so it's when the endpoint is invoked on the event loop, regardless the emission thread)
works if we use the Quarkus specific serializer

geoand commented 9 months ago

Very very weird indeed. I'll try and get to the bottom of it tomorrow

geoand commented 9 months ago

In a terminal, run: http ":8080/test/count?client=JAVA&iterations=10000"

Do the problem manifest when just hitting bytesAsync or is this very high number of iterations needed?

geoand commented 9 months ago

~~I did a very rough test and can't reproduce the problem.~~ It seems like I can reproduce it with a high enough number of iterations

geoand commented 9 months ago

There might be some super obscure race condition in ResteasyReactiveOutputStream... The problem is that I don't see anything weird in the thread dump

cescoffier commented 9 months ago

I ended up with the same conclusion. That's why I mentioned the null handlers. Normally, it's not an issue because everything runs from the same event loop, so, we just enqueue the task.

I've checked that we are not buffering, so, we only do: response.write(...) and when close is called: response.end().

Holding the connection lock can be an issue, but thread dumps didn't show any deadlock.

geoand commented 9 months ago

Yeah... So I am really at a loss about what to do...

cescoffier commented 9 months ago

It's more and more strange. I've instrumented the code to check if the responses were ended. Indeed, based on the report (idle timeout thrown), the request is processed, but the response has not ended.

It's the case.... For some unknown reasons, the responses have not ended. But here is the mystery: the number of responses not ended is always 62 on my machine. Why this magic number?

geoand commented 9 months ago

Does it have something to do with the fork join pool threads or something?

cescoffier commented 9 months ago

No, it's not related to the f/j. That was my first assumption, but I still have the same issue without.

Few new discoveries:

the drain handler is called (meaning that we have to wait to be writable) - we may have a race here
the connection based locking may miss a notification.
the hanging request have not been written at all - so it's not only the end that is missing it's any write

geoand commented 9 months ago

Lovely...

cescoffier commented 9 months ago

Ok, so @geoand and I did some debugging. The drain handler is not called for some reasons. It seems to be a bug in Vert.x.

It is related to HTTP/2 with TLS (otherwise it works).

@vietj any reason why the drain handler would not be called in the case?

geoand commented 9 months ago

@cescoffier and I wondering if this could even be a Netty issue as it only happens when using HTTP2 and SSL.

Moreover the furthest down the rabbit hole I've gone is to be find that Http2ServerResponse#handlerWritabilityChanged is never called with writable set to true for the problematic case (while it does get called as expected for all other cases).

@vietj ^

geoand commented 9 months ago

Actually I see the same thing happening for HTTP 1x as well, so HTTP2 does not seem to be required to trigger the issue - just the use of SSL

geoand commented 9 months ago

And now I am seeing even more complicated behavior...

I am seeing the handlerWritabilityChanged being called, but before we register our drain handlers...

So I am still not sure where the problem lies...

geoand commented 9 months ago

So I am still not sure where the problem lies...

I am starting to think there might be two different things:

handlerWritabilityChanged not being called sometimes when SSL is being used
we might have to register drain handlers sooner

geoand commented 9 months ago

https://github.com/quarkusio/quarkus/pull/38444 fixes the second part of my comment above, however the first part still remains and I believe it's a Vert.x or Netty issue.

cc @cescoffier @franz1981 @vietj

mikaelmercur commented 9 months ago

Hi, just wanted to let you know that the following ServerMessageBodyWriter also hangs:

@Provider
@Produces({"application/json", "application/cbor"})
public class DataBodyWriter implements ServerMessageBodyWriter<Data> {

    @Override
    public boolean isWriteable(Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
        return Data.class.isAssignableFrom(type);
    }

    @Override
    public void writeTo(Data data, Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType, MultivaluedMap<String, Object> httpHeaders, OutputStream entityStream) throws IOException, WebApplicationException {
        entityStream.write(data.bytes());
    }

    @Override
    public boolean isWriteable(Class<?> type, Type genericType, ResteasyReactiveResourceInfo target, MediaType mediaType) {
        return Data.class.isAssignableFrom(type);
    }

    @Override
    public void writeResponse(Data o, Type genericType, ServerRequestContext context) throws WebApplicationException, IOException {
        try (var out = context.getOrCreateOutputStream()) {
            out.write(o.bytes());
        }
    }
}

Hope this is helpful.

geoand commented 9 months ago

Thanks,

given that the code is doing:

    @Override
    public void writeResponse(Data o, Type genericType, ServerRequestContext context) throws WebApplicationException, IOException {
        try (var out = context.getOrCreateOutputStream()) {
            out.write(o.bytes());
        }
    }

and the findings so far, it's expected that it hangs

cescoffier commented 9 months ago

I've updated the PR with a possible workaround.

quarkusio / quarkus