Closed mikaelmercur closed 9 months ago
Are you sure that the OKHttp is intended to be used in parallel (concurrently) like this? It works as soon as I remove parallel.
Also, it works perfectly with the Vert.x Web Client:
public DataResource(Vertx vertx) {
client = WebClient.create(vertx, new WebClientOptions().setSsl(true).setTrustAll(true));
}
@GET
@Path("count")
@Produces(MediaType.TEXT_PLAIN)
public int getCount(@DefaultValue("100") @QueryParam("iterations") int iterations) {
System.out.println(Instant.now() + " Request started...");
return Multi.createFrom().range(0, iterations)
.onItem().transformToUni(i -> client.getAbs("https://localhost:8443/test/bytesAsync").send())
.merge(100)
.map(response -> response.body().length())
.collect().asList()
.map(l -> l.stream().mapToInt(i -> i).sum())
.await().indefinitely();
}
@cescoffier, OkHttpClient is thread safe and you are supposed to share a single client for all threads. Same for the Java HttpClient. You are correct that the problem is tied to parallel requests, but also to Http/2 and to use of SSL. The vertx web client seems to be a bit slower (at least in my testing) compared to java's http client and ok httpclient, which could indicate that the vertx client doesn't use as much parallellism as the others (?). I still believe it is on the Quarkus side, since it hangs both with OkHttp and Java's http client. Also, downgrading Quarkus to 2.16.12 (as we use in production) the probability of getting the hang is way lower. hanging-ssl-2.zip is a slightly modified version of the initial project where you can try with OkHttpClient, Java's HttpClient and the Vertx Webclient by using the query parameter "client" with value "ok" for OkHttpClient, "java" for the Java HttpClient and "vertx" for the vertx webclient. For me, only the vertx client passes that test.
The web client concurrency is configurable. You can increase it. I just used the default.
Now, there is not much we can do in Quarkus. It seems to be a client issue.
I will have a look to the new reproducer, however, I would recommend using the vertx client or the rest client. At least they integrate correctly with the I/O mechanism of Quarkus.
If it is the client side, then why would it differ so much between Quarkus 2 and Quarkus 3? If you downgrade Quarkus to 2.16 in hanging-ssl-2 (you have to set the version in gradle.properties and fix a couple of imports) you can see that there is no problem running the example with default settings (java http client, 100 iterations). Changing to OkHttpClient works fine as well with 100 iterations. In Quarkus 3 (for me at least) the program always get stuck with the java client and the okhttpclient in less than 100 iterations.
Also, changing the endpoint to return Data instead of CompletableFuture\<Data> makes the requests not hang anymore with either of the clients. You can test that by changing the uri from "https://localhost:8443/test/bytesAsync" to "https://localhost:8443/test/bytes"
There is another issue in the reproducer.
Using parallel()
schedules tasks on the common fork/join pool. On the server side (running on the same JVM), you use supplyAsync which schedules tasks on the common fork/join pool. So, basically, you are exhausting it.
If you do:
@GET
@Path("bytesAsync")
@ResponseStatus(200)
@Produces({"application/json", "application/cbor"}) // Not really what is produced...
public Uni<Data> getBytesAsync() {
return Uni.createFrom().item(DATA).emitOn(Infrastructure.getDefaultExecutor());
}
Then it works (because only the client uses the F/J).
Removed the parallel() part in the client code, it is still hangning with java http client and ok http client, see hanging-ssl-3.zip. This is, of course, not how our real application is constructed (we have a client and a server). This is just a simple example, with similar preconditions as our real application, that is hanging in a similar way.
I would heavily recommend to NOT use supplyAsync without an executor. Relying on the f/j can be problematic (especially in containers). Why it's exhausted in this case is unclear. It might deadlock somewhere. I'm pretty sure it can be reproduced outside Quarkus (as soon as you have the concurrent access on the ssl context or things like this).
I have provided several workaround as comments. As mentioned I would recommend using the clients provided by Quarkus which avoid using the f/j, and integrate correctly with the I/O layer.
If I change the endpoint to:
public CompletableFuture<Data> getBytesAsync() {
return CompletableFuture.completedFuture(DATA);
}
it still hangs. As I said, this is not what we do in our real application, but the example behaves in a similar way.
Try with Uni instead of CompletableFuture.
I just tried with:
@GET
@Path("bytesAsync")
@ResponseStatus(200)
@Produces({"application/json", "application/cbor"}) // Not really what is produced...
public CompletableFuture<Data> getBytesAsync() {
return CompletableFuture.completedFuture(DATA);
}
with the 3 clients:
β¦/quarkus-llm-workshop | Git ξ bump-0.7.0 [β ] | Java v21.0.1 took 2s
π 07:47:54 β http ":8080/test/count?client=JAVA&iterations=10000"
HTTP/1.1 200 OK
Content-Type: text/plain;charset=UTF-8
content-length: 10
1000000000
β¦/quarkus-llm-workshop | Git ξ bump-0.7.0 [β ] | Java v21.0.1 took 39s
π 07:50:53 β http ":8080/test/count?client=OK&iterations=10000"
HTTP/1.1 200 OK
Content-Type: text/plain;charset=UTF-8
content-length: 10
1000000000
β¦/quarkus-llm-workshop | Git ξ bump-0.7.0 [β ] | Java v21.0.1 took 37s
π 07:51:52 β http ":8080/test/count?client=VERTX&iterations=10000"
HTTP/1.1 200 OK
Content-Type: text/plain;charset=UTF-8
content-length: 10
1000000000
No problem at all.
However, I'm using Quarkus 999-VERSION. Please check with 3.7 as it may have already been fixed.
With 3.7.0.CR7 it still freezes for me. However, I had to increase the number of iterations to 10000 or run the example twice with just 100 iterations to make the probability of a hang as high as before. I can still get it first try with 100 iterations sometimes though using the java client.
Can you run the following without freeze: curl "http://localhost:8080/test/count?client=java&iterations=10000"
And can you do the following twice without a freeze: curl "http://localhost:8080/test/count?client=java"
A lot of our server-side code is built upon CompletableFuture, that's why it is convenient to just make the endpoint return it. But to convert the endpoint to return a Uni I guess we can do something similar to:
@GET
@Path("bytesAsync")
@ResponseStatus(200)
@Produces({"application/json", "application/cbor"}) // Not really what is produced...
public Uni<Data> getBytesAsync() {
return Uni.createFrom()
.completionStage(CompletableFuture.completedFuture(DATA))
.emitOn(Infrastructure.getDefaultExecutor());
}
However, this also hangs for me.
Edit: I changed the endpoint as suggested to:
@GET
@Path("bytesAsync")
@ResponseStatus(200)
@Produces({"application/json", "application/cbor"}) // Not really what is produced...
public Uni<Data> getBytesAsync() {
return Uni.createFrom().item(DATA).emitOn(Infrastructure.getDefaultExecutor());
}
On the 11:th try the following call got a hang:
curl "http://localhost:8080/test/count?client=java&iterations=1000"
Ok, I will try to make it hang on my machine - so far no luck. I need to generate a thread dump when it happens.
I refactored my client so that it runs in a different process now. threads.txt is a thread dump from Quarkus during the hang. Hopefully that will help. The thread dump is from using the Uni endpoint as you suggested (without CompletableFuture at all). client_threads.txt is a thread dump from the client.
Hum, no deadlock. Can you run it without the dev mode?
threadsNotDevMode.txt is a thread dump from when not running in dev mode.
So, I'm dug a bit more. It seems to be something in resteasy reactive (server). I was unable to get it to hang with a plain route.
Next step... seems to be due to the serialization of the content when emitted from a Uni or CompletionStage.
Can you try with this serializer?
@Provider
@Produces({"application/json", "application/cbor"})
public class DataBodyWriter implements ServerMessageBodyWriter<Data> {
@Override
public boolean isWriteable(Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
return Data.class.isAssignableFrom(type);
}
@Override
public void writeTo(Data data, Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType, MultivaluedMap<String, Object> httpHeaders, OutputStream entityStream) throws IOException, WebApplicationException {
entityStream.write(data.bytes());
}
@Override
public boolean isWriteable(Class<?> type, Type genericType, ResteasyReactiveResourceInfo target, MediaType mediaType) {
return true;
}
@Override
public void writeResponse(Data o, Type genericType, ServerRequestContext context) throws WebApplicationException, IOException {
context.serverResponse().end(o.bytes());
}
}
Not sure what's the problem yet.
That actually seems to solve the problem, and might be a feasible workaroud for our problem in our real application. I will test that as well.
It seems to solve the problem in our real application as well. However, since ServerMessageBodyWriter extends MessageBodyWriter, can we be sure that this workaroud covers all code paths, or can there still be cases where we might get the hang?
Thanks for the feedback! It's a workaround, I need to check exactly what's going on.
The new class is what we use internally. So, it's safe. But it should not be required, so something is definitely off.
Ok, I believe I understand the issue. @geoand, fasten your seat belt.
So, in this context, the endpoint is executed on the event loop (because it either returns a Uni or a CompletionStage). The application contains its own serializer.
In ServerSerialisers
, the invokeWriter
detect whether the serializer is a ServerMessageBodyWriter
(that's the workaround posted above) or a plain MessageBodyWriter
. This latter case is faulty in this context. It needs to be clarified what is going on. I do not see deadlocks or event loops blocked.
Looking at the code, I'm not sure the MessageBodyWriter
should be called from the event loop. @geoand is it allowed to invoke that on the event loop? I can't see anything blocking, however, because we pass null
as handlers, we may not orchestrate the calls correctly. They should all be enqueued in the same event loop queue. However, something weird happens.
@geoand is it allowed to invoke that on the event loop?
Yeah, there is no real reason why they can't
How can I try reproduce the problem?
I'm using the second reproducer.
gradle.properties
to:#Gradle properties
quarkusPluginId=io.quarkus
quarkusPluginVersion=999-SNAPSHOT
quarkusPlatformGroupId=io.quarkus
quarkusPlatformArtifactId=quarkus-bom
quarkusPlatformVersion=999-SNAPSHOT
Run the app (gradle quarkusDev)
In a terminal, run: http ":8080/test/count?client=JAVA&iterations=10000"
Thanks!
I'll have a look tomorrow
The weird things is:
Very very weird indeed. I'll try and get to the bottom of it tomorrow
In a terminal, run: http ":8080/test/count?client=JAVA&iterations=10000"
Do the problem manifest when just hitting bytesAsync
or is this very high number of iterations needed?
I did a very rough test and can't reproduce the problem. It seems like I can reproduce it with a high enough number of iterations
There might be some super obscure race condition in ResteasyReactiveOutputStream
...
The problem is that I don't see anything weird in the thread dump
I ended up with the same conclusion. That's why I mentioned the null
handlers. Normally, it's not an issue because everything runs from the same event loop, so, we just enqueue the task.
I've checked that we are not buffering, so, we only do: response.write(...)
and when close
is called: response.end()
.
Holding the connection lock can be an issue, but thread dumps didn't show any deadlock.
Yeah... So I am really at a loss about what to do...
It's more and more strange. I've instrumented the code to check if the responses were ended. Indeed, based on the report (idle timeout thrown), the request is processed, but the response has not ended.
It's the case.... For some unknown reasons, the responses have not ended. But here is the mystery: the number of responses not ended is always 62 on my machine. Why this magic number?
Does it have something to do with the fork join pool threads or something?
No, it's not related to the f/j. That was my first assumption, but I still have the same issue without.
Few new discoveries:
Lovely...
Ok, so @geoand and I did some debugging. The drain handler is not called for some reasons. It seems to be a bug in Vert.x.
It is related to HTTP/2 with TLS (otherwise it works).
@vietj any reason why the drain handler would not be called in the case?
@cescoffier and I wondering if this could even be a Netty issue as it only happens when using HTTP2 and SSL.
Moreover the furthest down the rabbit hole I've gone is to be find that Http2ServerResponse#handlerWritabilityChanged
is never called with writable
set to true
for the problematic case (while it does get called as expected for all other cases).
@vietj ^
Actually I see the same thing happening for HTTP 1x as well, so HTTP2 does not seem to be required to trigger the issue - just the use of SSL
And now I am seeing even more complicated behavior...
I am seeing the handlerWritabilityChanged being called, but before we register our drain handlers...
So I am still not sure where the problem lies...
So I am still not sure where the problem lies...
I am starting to think there might be two different things:
handlerWritabilityChanged
not being called sometimes when SSL is being used https://github.com/quarkusio/quarkus/pull/38444 fixes the second part of my comment above, however the first part still remains and I believe it's a Vert.x or Netty issue.
cc @cescoffier @franz1981 @vietj
Hi, just wanted to let you know that the following ServerMessageBodyWriter
also hangs:
@Provider
@Produces({"application/json", "application/cbor"})
public class DataBodyWriter implements ServerMessageBodyWriter<Data> {
@Override
public boolean isWriteable(Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
return Data.class.isAssignableFrom(type);
}
@Override
public void writeTo(Data data, Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType, MultivaluedMap<String, Object> httpHeaders, OutputStream entityStream) throws IOException, WebApplicationException {
entityStream.write(data.bytes());
}
@Override
public boolean isWriteable(Class<?> type, Type genericType, ResteasyReactiveResourceInfo target, MediaType mediaType) {
return Data.class.isAssignableFrom(type);
}
@Override
public void writeResponse(Data o, Type genericType, ServerRequestContext context) throws WebApplicationException, IOException {
try (var out = context.getOrCreateOutputStream()) {
out.write(o.bytes());
}
}
}
Hope this is helpful.
Thanks,
given that the code is doing:
@Override
public void writeResponse(Data o, Type genericType, ServerRequestContext context) throws WebApplicationException, IOException {
try (var out = context.getOrCreateOutputStream()) {
out.write(o.bytes());
}
}
and the findings so far, it's expected that it hangs
I've updated the PR with a possible workaround.
Describe the bug
Requests hang when an endpoint is returning a CompletableFuture and SSL is active. When turning off SSL or the endpoint is changed to not return a CompletableFuture, the requests seem not to hang. The requests hang for the time specified in
quarkus.http.idle-timeout
.Expected behavior
Requests should not hang.
Actual behavior
Requests hang.
How to Reproduce?
hanging-ssl.zip
quarkus.http.idle-timeout
it is, but that defaults to 30m)The example code uses OkHttpClient, but it is possible to reproduce with java.net.HttpClient aswell, but that seems to require more iterations.
Output of
uname -a
orver
No response
Output of
java -version
No response
Quarkus version or git rev
3.6.1
Build tool (ie. output of
mvnw --version
orgradlew --version
)gradle
Additional information
No response