Increase idle timeout - Githubissues

uneetsingh commented 1 year ago

Hi! I have been attempting to increase the idleTimeout but seems like it remain the default 30 seconds only.

Even though here idleTimout is set to 120 seconds, jetty is still enforcing it's default 30s as timeout. Below are few logs to make the case:

[05/Jun/2023:07:42:15 +0000] "POST /service/processQuantityText HTTP/1.1" 200 47605 "-" "python-requests/2.31.0" 30181
[05/Jun/2023:07:42:16 +0000] "POST /service/processQuantityText HTTP/1.1" 200 47605 "-" "python-requests/2.31.0" 30678
[05/Jun/2023:07:42:16 +0000] "POST /service/processQuantityText HTTP/1.1" 200 47605 "-" "python-requests/2.31.0" 31110
[05/Jun/2023:07:42:16 +0000] "POST /service/processQuantityText HTTP/1.1" 200 47604 "-" "python-requests/2.31.0" 29445
[05/Jun/2023:07:42:17 +0000] "POST /service/processQuantityText HTTP/1.1" 200 47605 "-" "python-requests/2.31.0" 29920
[05/Jun/2023:07:42:17 +0000] "POST /service/processQuantityText HTTP/1.1" 503 0 "-" "python-requests/2.31.0" 30052
[05/Jun/2023:07:42:17 +0000] "POST /service/processQuantityText HTTP/1.1" 503 0 "-" "python-requests/2.31.0" 30052
[05/Jun/2023:07:42:17 +0000] "POST /service/processQuantityText HTTP/1.1" 503 0 "-" "python-requests/2.31.0" 30050
[05/Jun/2023:07:42:17 +0000] "POST /service/processQuantityText HTTP/1.1" 503 0 "-" "python-requests/2.31.0" 30051
[05/Jun/2023:07:42:17 +0000] "POST /service/processQuantityText HTTP/1.1" 200 47605 "-" "python-requests/2.31.0" 30405
[05/Jun/2023:07:42:18 +0000] "POST /service/processQuantityText HTTP/1.1" 200 47605 "-" "python-requests/2.31.0" 31708
[05/Jun/2023:07:42:18 +0000] "POST /service/processQuantityText HTTP/1.1" 503 0 "-" "python-requests/2.31.0" 30051
[05/Jun/2023:07:42:18 +0000] "POST /service/processQuantityText HTTP/1.1" 503 0 "-" "python-requests/2.31.0" 30051
[05/Jun/2023:07:42:18 +0000] "POST /service/processQuantityText HTTP/1.1" 503 0 "-" "python-requests/2.31.0" 30051
[05/Jun/2023:07:42:18 +0000] "POST /service/processQuantityText HTTP/1.1" 503 0 "-" "python-requests/2.31.0" 30052
[05/Jun/2023:07:42:18 +0000] "POST /service/processQuantityText HTTP/1.1" 503 0 "-" "python-requests/2.31.0" 30050
[05/Jun/2023:07:42:18 +0000] "POST /service/processQuantityText HTTP/1.1" 503 0 "-" "python-requests/2.31.0" 30051
[05/Jun/2023:07:42:18 +0000] "POST /service/processQuantityText HTTP/1.1" 503 0 "-" "python-requests/2.31.0" 30051
[05/Jun/2023:07:42:18 +0000] "POST /service/processQuantityText HTTP/1.1" 503 0 "-" "python-requests/2.31.0" 30051
[05/Jun/2023:07:42:18 +0000] "POST /service/processQuantityText HTTP/1.1" 503 0 "-" "python-requests/2.31.0" 30054
[05/Jun/2023:07:42:18 +0000] "POST /service/processQuantityText HTTP/1.1" 200 47605 "-" "python-requests/2.31.0" 32156
[05/Jun/2023:07:42:19 +0000] "POST /service/processQuantityText HTTP/1.1" 200 47605 "-" "python-requests/2.31.0" 32700
[05/Jun/2023:07:42:19 +0000] "POST /service/processQuantityText HTTP/1.1" 503 0 "-" "python-requests/2.31.0" 30055
[05/Jun/2023:07:42:19 +0000] "POST /service/processQuantityText HTTP/1.1" 503 0 "-" "python-requests/2.31.0" 30054
[05/Jun/2023:07:42:19 +0000] "POST /service/processQuantityText HTTP/1.1" 503 0 "-" "python-requests/2.31.0" 30069
[05/Jun/2023:07:42:19 +0000] "POST /service/processQuantityText HTTP/1.1" 503 0 "-" "python-requests/2.31.0" 30063
[05/Jun/2023:07:42:19 +0000] "POST /service/processQuantityText HTTP/1.1" 503 0 "-" "python-requests/2.31.0" 30054
[05/Jun/2023:07:42:19 +0000] "POST /service/processQuantityText HTTP/1.1" 503 0 "-" "python-requests/2.31.0" 30074
[05/Jun/2023:07:42:19 +0000] "POST /service/processQuantityText HTTP/1.1" 503 0 "-" "python-requests/2.31.0" 30066
[05/Jun/2023:07:42:19 +0000] "POST /service/processQuantityText HTTP/1.1" 503 0 "-" "python-requests/2.31.0" 30068
[05/Jun/2023:07:42:19 +0000] "POST /service/processQuantityText HTTP/1.1" 200 47605 "-" "python-requests/2.31.0" 33169
[05/Jun/2023:07:42:20 +0000] "POST /service/processQuantityText HTTP/1.1" 200 47605 "-" "python-requests/2.31.0" 33582
[05/Jun/2023:07:42:20 +0000] "POST /service/processQuantityText HTTP/1.1" 200 47605 "-" "python-requests/2.31.0" 34018
[05/Jun/2023:07:42:20 +0000] "POST /service/processQuantityText HTTP/1.1" 200 47604 "-" "python-requests/2.31.0" 30622
[05/Jun/2023:07:42:21 +0000] "POST /service/processQuantityText HTTP/1.1" 200 47605 "-" "python-requests/2.31.0" 31872
[05/Jun/2023:07:42:21 +0000] "POST /service/processQuantityText HTTP/1.1" 200 47605 "-" "python-requests/2.31.0" 32318
[05/Jun/2023:07:42:22 +0000] "POST /service/processQuantityText HTTP/1.1" 200 47605 "-" "python-requests/2.31.0" 33799
[05/Jun/2023:07:42:22 +0000] "POST /service/processQuantityText HTTP/1.1" 200 47605 "-" "python-requests/2.31.0" 34354

lfoppiano commented 1 year ago

Hi, thanks for your interest in grobid-quantities. I've reformatted the issue to make it more readable.

How many requests are you sending? From the log it seems that your requests look shorter than 30 seconds. The idleTimeout works for requests that last more than 30 seconds.

In this case the requests get rejected because you're sending more than the allowed parallel request. There is an option in the config.yml: maxParallelRequests: 0

When is set to 0, it will limit to the number of CPUs available. How many CPUs do you have? You can also try to change it to 2 and should return 503 after your exceed that number.

Let me know if you have more questions

uneetsingh commented 1 year ago

Thank you so much for response.

This above log is part of load testing. So requests per second could be around ~50. And hence I understand a lot of them would fail.

How many CPUs do you have?

2

My understanding is this: If I send more requests than set through maxParallelRequests, they are queued and if they remain idle for more then idleTimeout in the queue they are failed. Please correct me if this understanding is flawed.

The idleTimeout works for requests that last more than 30 seconds.

This is little confusing. Can you elaborate on this? If idleTimeout is set to 120s, the extra requests should remain suspended in queue and their connection will be closed if they remain idle for more than 120s. Correct?

lfoppiano commented 1 year ago

OK, I was mistaken with the grobid PDF documents processing. I have implemented this long time ago, for the trouble.

For this case, there is another parameter in the QoSFilter, which is called "waitMs" defaulted 50ms.

If you can rebuild grobid-quantities, you could try by adding this line in line 81 of GrobidQuantitiesApplication

        qos.setInitParameter("waitMs", String.valueOf("120"));

That should change the waiting time to 120 ms. You can enlarge it more based on your needs. if this works we can add it as parameter from the configuration.

Another thing I suggest is to make sure the client handle the 503 response and wait before requesting again. The python client should do that already: https://github.com/lfoppiano/grobid-quantities-python-client/

uneetsingh commented 1 year ago

Thank you. This is very helpful. I will update this thread with the feedback once i rerun the service after rebuilding.

lfoppiano commented 1 year ago

I've made the modification in the branch feature/add-wait-ms. You can change the timeout from the configuration.

lfoppiano commented 1 year ago

@uneetsingh did you manage to do any progress on this?

I'd like to release 0.7.3 before the end of the month, including this PR if this solution is working. Could you share your script, I'd like to try to reproduce the same behaviour you reported.

uneetsingh commented 1 year ago

qos.setInitParameter("waitMs", String.valueOf("120"));

I tried this and the issue still persisted i.e. requests that took more than 30s failed with 503 status. And apologies couldn't test using the new branch that you have pushed.

Regarding error reproduction, I was doing load testing using locust. Below is the simple code

from locust import HttpUser, task, between
import random
text = '''Put some test text here'''

class AppUser(HttpUser):
    wait_time = between(1,6)
    @task
    def index_page(self):
        response = self.client.post("/service/processQuantityText", files = {
            "text":text + str(random.randint(3, 9))
        })
        if response.status_code!=200:
            print(response.status_code, response.text)

I was spawning 100 users with with Spawn rate as 10.

[Update] On the grobid quant server front, It was a 2 cpu machine with default configurations for the server.

lfoppiano commented 1 year ago

Thanks!

I've revised a bit the matter, and I think the waitMs is used only to decide whether to suspend or not the request.

I've revised the locust test I had written some time ago, you can find an old updated test in resources/locust which is similar to your snippet.

I've started the server with maxParallelRequest:2 and waitMs: 1, and I've ran a test with 100 users for 10 minutes, and I did not get any failure. See report: Test Report for locustfile.py.pdf

Then I tried with 10000 users and it started failing after 2000 parallel users: Test Report for locustfile.py.pdf

I suppose it's also related to the amount of threads and queue that are allocated, see from the config.yaml:

  maxThreads: 2048
  maxQueuedRequests: 2048
  acceptQueueSize: 2048

Your script also uses the processTextQuantity, I wonder, are you sure there were requests that took more than 30s to be completed? Could you check that? It's quite strange that this could happens with text requests

uneetsingh commented 1 year ago

Thank you!

From the report's you have posted also, it is clear there seems to be some threshold of 30s(10000 users report) as the response time has an upper cap of 30s and the response time curve is maturing at 30s. I was trying to increase this threshold. I tried changing the maxThreads, maxQueuedRequests, acceptQueueSize also but that no headway here also.

Whether this 503 failure happens at 100 users or 10000 users depend on the size of the payload. The payload which I was sending was three four paragraphs long.

Your script also uses the processTextQuantity

Yes it does. If you see the logs which I have posted in the issue description, that's the same endpoint.

lfoppiano commented 1 year ago

@uneetsingh you're right! I did not see it... indeed! I think now I understood what you mean.

I tried to change the parameter idleTimeout but as you said since the beginning 😅 it's not working.

The only solution could be this: https://github.com/dropwizard/dropwizard/issues/1674#issuecomment-249389852 but I'm not so sure is doable...

uneetsingh commented 1 year ago

@lfoppiano That's okay! 😅 You have been helping throughout. I really appreciate that.

The dropwizard issue you have pointed to seems to address the problem directly. Will try to do it. What is key here is the meaning of idleTimeout. That dropwizard issue at least clears that.idleTiemout is in no way related to the processing time but instead with the time it takes for the server to handle requests(lock thread for that request).

Will update if I have any solution for this. Thanks a lot again.

lfoppiano commented 1 year ago

Thank you!

By the way, I'm planning to update grobid-quantities to dropwizard 4 at some point (in #136, although it says dropwizard 2) and we might end up having this fixed with a more recent version of jetty. I'm not sure.

lfoppiano commented 1 year ago

Just a quick update, I cannot really let it go so easily 🐶 I've updated the configuration correctly in both master and the PR. If you set the idleTimeout option at the level of the connector, it's correctly loaded in the configuration and in the server connector (it's a bit hard to see):

However after testing again with locust, it caps at 30s still.

If I debug the class AbstractConnector, and I make a request, the value is correctly set:

So I haven't followed all the trail but might be useful to double check with dropwizard 4.0 and a new version of jetty.

lfoppiano commented 1 year ago

I think the idleTimeout works fine.

E.g. If i set it to 2 seconds, I get:

DEBUG  [11:05:43.872] [dw-26] o.e.j.i.IdleTimeout -  Setting idle timeout 0 -> 2000 on SocketChannelEndPoint@e7d8c9e[{l=/127.0.0.1:8060,r=/127.0.0.1:53272,OPEN,fill=-,flush=-,to=13/2000}{io=0/0,kio=0,kro=0}]->[<null>]

and then it fails when it goes over 2 seconds:

Timeout -  SocketChannelEndPoint@e7d8c9e[{l=/127.0.0.1:8060,r=/127.0.0.1:53272,OPEN,fill=-,flush=-,to=1764/2000}{io=0/0,kio=0,kro=1}]->[HttpConnection@3104d8f1[p=HttpParser{s=END,2659208 of 2659208},g=HttpGenerator@7ec3ba88{s=START}]=>HttpChannelOverHttp@2e096ac{s=HttpChannelState@110cb986{s=HANDLING rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=1},r=1,c=false/false,a=HANDLING,uri=http://localhost:8060/service/annotateQuantityPDF,age=1967}] idle timeout check, elapsed: 1764 ms, remaining: 236 ms 
DEBUG  [11:05:46.103] [Scheduler-713702106-1] o.e.j.i.IdleTimeout -  SocketChannelEndPoint@e7d8c9e[{l=/127.0.0.1:8060,r=/127.0.0.1:53272,OPEN,fill=-,flush=-,to=2006/2000}{io=0/0,kio=0,kro=1}]->[HttpConnection@3104d8f1[p=HttpParser{s=END,2659208 of 2659208},g=HttpGenerator@7ec3ba88{s=START}]=>HttpChannelOverHttp@2e096ac{s=HttpChannelState@110cb986{s=HANDLING rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=1},r=1,c=false/false,a=HANDLING,uri=http://localhost:8060/service/annotateQuantityPDF,age=2208}] idle timeout check, elapsed: 2006 ms, remaining: -6 ms 
DEBUG  [11:05:46.103] [Scheduler-713702106-1] o.e.j.i.IdleTimeout -  SocketChannelEndPoint@e7d8c9e[{l=/127.0.0.1:8060,r=/127.0.0.1:53272,OPEN,fill=-,flush=-,to=2006/2000}{io=0/0,kio=0,kro=1}]->[HttpConnection@3104d8f1[p=HttpParser{s=END,2659208 of 2659208},g=HttpGenerator@7ec3ba88{s=START}]=>HttpChannelOverHttp@2e096ac{s=HttpChannelState@110cb986{s=HANDLING rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=1},r=1,c=false/false,a=HANDLING,uri=http://localhost:8060/service/annotateQuantityPDF,age=2208}] idle timeout expired 
DEBUG  [11:05:46.104] [Scheduler-713702106-1] o.e.j.i.FillInterest -  onFail FillInterest@7cfd432c{null} 
java.util.concurrent.TimeoutException: Idle timeout expired: 2006/2000 ms
    at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:170)
    at org.eclipse.jetty.io.IdleTimeout.idleCheck(IdleTimeout.java:112)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)

If I set it to 33 seconds, I get timeout as before:

java.util.concurrent.TimeoutException: Idle timeout expired: 33001/33000 ms
    at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:171)
    at org.eclipse.jetty.io.IdleTimeout.idleCheck(IdleTimeout.java:113)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)

Then if I set it to 60 seconds, I get no such exceptions... In all the cases the maximum requests timeout was max 30s:

I start to believe the idleTimeout is correctly set, and the problem is somewhere else

lfoppiano commented 11 months ago

@uneetsingh I'm preparing for the new release 0.8.0 that among updating grobid, updates dropwizard to version 4. Have you had any progress on this issue?

uneetsingh commented 11 months ago

Hi @lfoppiano, No headway on that issue. I hope that Version 4 will solve this.

lfoppiano commented 11 months ago

I've merged PR #136 in the master. I did some tests, although the logs don't mention any timeout. I see some 503 after the timeout requests are reaching 30s. Could you give it a try? You can use the docker image lfoppiano/grobid-quantities:latest-develop that should have all the fixes.

Here there was a fix in Jetty 10 to make the idle-timeout work, however, I'm not sure what else could be.

lfoppiano / grobid-quantities

Increase idle timeout #159