Open uneetsingh opened 1 year ago
Hi, thanks for your interest in grobid-quantities. I've reformatted the issue to make it more readable.
How many requests are you sending?
From the log it seems that your requests look shorter than 30 seconds. The idleTimeout
works for requests that last more than 30 seconds.
In this case the requests get rejected because you're sending more than the allowed parallel request. There is an option in the config.yml
: maxParallelRequests: 0
When is set to 0, it will limit to the number of CPUs available. How many CPUs do you have? You can also try to change it to 2 and should return 503 after your exceed that number.
Let me know if you have more questions
Thank you so much for response.
This above log is part of load testing. So requests per second could be around ~50. And hence I understand a lot of them would fail.
How many CPUs do you have?
2
My understanding is this:
If I send more requests than set through maxParallelRequests
, they are queued and if they remain idle for more then idleTimeout in the queue they are failed. Please correct me if this understanding is flawed.
The idleTimeout works for requests that last more than 30 seconds.
This is little confusing. Can you elaborate on this? If idleTimeout is set to 120s, the extra requests should remain suspended in queue and their connection will be closed if they remain idle for more than 120s. Correct?
OK, I was mistaken with the grobid PDF documents processing. I have implemented this long time ago, for the trouble.
For this case, there is another parameter in the QoSFilter, which is called "waitMs" defaulted 50ms.
If you can rebuild grobid-quantities, you could try by adding this line in line 81 of GrobidQuantitiesApplication
qos.setInitParameter("waitMs", String.valueOf("120"));
That should change the waiting time to 120 ms. You can enlarge it more based on your needs. if this works we can add it as parameter from the configuration.
Another thing I suggest is to make sure the client handle the 503 response and wait before requesting again. The python client should do that already: https://github.com/lfoppiano/grobid-quantities-python-client/
Thank you. This is very helpful. I will update this thread with the feedback once i rerun the service after rebuilding.
I've made the modification in the branch feature/add-wait-ms
. You can change the timeout from the configuration.
@uneetsingh did you manage to do any progress on this?
I'd like to release 0.7.3 before the end of the month, including this PR if this solution is working. Could you share your script, I'd like to try to reproduce the same behaviour you reported.
qos.setInitParameter("waitMs", String.valueOf("120"));
I tried this and the issue still persisted i.e. requests that took more than 30s failed with 503 status. And apologies couldn't test using the new branch that you have pushed.
Regarding error reproduction, I was doing load testing using locust. Below is the simple code
from locust import HttpUser, task, between
import random
text = '''Put some test text here'''
class AppUser(HttpUser):
wait_time = between(1,6)
@task
def index_page(self):
response = self.client.post("/service/processQuantityText", files = {
"text":text + str(random.randint(3, 9))
})
if response.status_code!=200:
print(response.status_code, response.text)
I was spawning 100 users with with Spawn rate as 10.
[Update] On the grobid quant server front, It was a 2 cpu machine with default configurations for the server.
Thanks!
I've revised a bit the matter, and I think the waitMs
is used only to decide whether to suspend or not the request.
I've revised the locust test I had written some time ago, you can find an old updated test in resources/locust
which is similar to your snippet.
I've started the server with maxParallelRequest:2 and waitMs: 1, and I've ran a test with 100 users for 10 minutes, and I did not get any failure. See report: Test Report for locustfile.py.pdf
Then I tried with 10000 users and it started failing after 2000 parallel users: Test Report for locustfile.py.pdf
I suppose it's also related to the amount of threads and queue that are allocated, see from the config.yaml:
maxThreads: 2048
maxQueuedRequests: 2048
acceptQueueSize: 2048
Your script also uses the processTextQuantity, I wonder, are you sure there were requests that took more than 30s to be completed? Could you check that? It's quite strange that this could happens with text requests
Thank you!
From the report's you have posted also, it is clear there seems to be some threshold of 30s(10000 users report) as the response time has an upper cap of 30s and the response time curve is maturing at 30s. I was trying to increase this threshold. I tried changing the maxThreads, maxQueuedRequests, acceptQueueSize also but that no headway here also.
Whether this 503 failure happens at 100 users or 10000 users depend on the size of the payload. The payload which I was sending was three four paragraphs long.
Your script also uses the processTextQuantity
Yes it does. If you see the logs which I have posted in the issue description, that's the same endpoint.
@uneetsingh you're right! I did not see it... indeed! I think now I understood what you mean.
I tried to change the parameter idleTimeout but as you said since the beginning 😅 it's not working.
The only solution could be this: https://github.com/dropwizard/dropwizard/issues/1674#issuecomment-249389852 but I'm not so sure is doable...
@lfoppiano That's okay! 😅 You have been helping throughout. I really appreciate that.
The dropwizard issue you have pointed to seems to address the problem directly. Will try to do it. What is key here is the meaning of idleTimeout. That dropwizard issue at least clears that.idleTiemout
is in no way related to the processing time but instead with the time it takes for the server to handle requests(lock thread for that request).
Will update if I have any solution for this. Thanks a lot again.
Thank you!
By the way, I'm planning to update grobid-quantities to dropwizard 4 at some point (in #136, although it says dropwizard 2) and we might end up having this fixed with a more recent version of jetty. I'm not sure.
Just a quick update, I cannot really let it go so easily 🐶
I've updated the configuration correctly in both master and the PR.
If you set the idleTimeout option
at the level of the connector, it's correctly loaded in the configuration and in the server connector (it's a bit hard to see):
However after testing again with locust, it caps at 30s still.
If I debug the class AbstractConnector, and I make a request, the value is correctly set:
So I haven't followed all the trail but might be useful to double check with dropwizard 4.0 and a new version of jetty.
I think the idleTimeout works fine.
E.g. If i set it to 2 seconds, I get:
DEBUG [11:05:43.872] [dw-26] o.e.j.i.IdleTimeout - Setting idle timeout 0 -> 2000 on SocketChannelEndPoint@e7d8c9e[{l=/127.0.0.1:8060,r=/127.0.0.1:53272,OPEN,fill=-,flush=-,to=13/2000}{io=0/0,kio=0,kro=0}]->[<null>]
and then it fails when it goes over 2 seconds:
Timeout - SocketChannelEndPoint@e7d8c9e[{l=/127.0.0.1:8060,r=/127.0.0.1:53272,OPEN,fill=-,flush=-,to=1764/2000}{io=0/0,kio=0,kro=1}]->[HttpConnection@3104d8f1[p=HttpParser{s=END,2659208 of 2659208},g=HttpGenerator@7ec3ba88{s=START}]=>HttpChannelOverHttp@2e096ac{s=HttpChannelState@110cb986{s=HANDLING rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=1},r=1,c=false/false,a=HANDLING,uri=http://localhost:8060/service/annotateQuantityPDF,age=1967}] idle timeout check, elapsed: 1764 ms, remaining: 236 ms
DEBUG [11:05:46.103] [Scheduler-713702106-1] o.e.j.i.IdleTimeout - SocketChannelEndPoint@e7d8c9e[{l=/127.0.0.1:8060,r=/127.0.0.1:53272,OPEN,fill=-,flush=-,to=2006/2000}{io=0/0,kio=0,kro=1}]->[HttpConnection@3104d8f1[p=HttpParser{s=END,2659208 of 2659208},g=HttpGenerator@7ec3ba88{s=START}]=>HttpChannelOverHttp@2e096ac{s=HttpChannelState@110cb986{s=HANDLING rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=1},r=1,c=false/false,a=HANDLING,uri=http://localhost:8060/service/annotateQuantityPDF,age=2208}] idle timeout check, elapsed: 2006 ms, remaining: -6 ms
DEBUG [11:05:46.103] [Scheduler-713702106-1] o.e.j.i.IdleTimeout - SocketChannelEndPoint@e7d8c9e[{l=/127.0.0.1:8060,r=/127.0.0.1:53272,OPEN,fill=-,flush=-,to=2006/2000}{io=0/0,kio=0,kro=1}]->[HttpConnection@3104d8f1[p=HttpParser{s=END,2659208 of 2659208},g=HttpGenerator@7ec3ba88{s=START}]=>HttpChannelOverHttp@2e096ac{s=HttpChannelState@110cb986{s=HANDLING rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=1},r=1,c=false/false,a=HANDLING,uri=http://localhost:8060/service/annotateQuantityPDF,age=2208}] idle timeout expired
DEBUG [11:05:46.104] [Scheduler-713702106-1] o.e.j.i.FillInterest - onFail FillInterest@7cfd432c{null}
java.util.concurrent.TimeoutException: Idle timeout expired: 2006/2000 ms
at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:170)
at org.eclipse.jetty.io.IdleTimeout.idleCheck(IdleTimeout.java:112)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
If I set it to 33 seconds, I get timeout as before:
java.util.concurrent.TimeoutException: Idle timeout expired: 33001/33000 ms
at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:171)
at org.eclipse.jetty.io.IdleTimeout.idleCheck(IdleTimeout.java:113)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Then if I set it to 60 seconds, I get no such exceptions... In all the cases the maximum requests timeout was max 30s:
I start to believe the idleTimeout is correctly set, and the problem is somewhere else
@uneetsingh I'm preparing for the new release 0.8.0 that among updating grobid, updates dropwizard to version 4. Have you had any progress on this issue?
Hi @lfoppiano, No headway on that issue. I hope that Version 4 will solve this.
I've merged PR #136 in the master. I did some tests, although the logs don't mention any timeout. I see some 503 after the timeout requests are reaching 30s.
Could you give it a try? You can use the docker image lfoppiano/grobid-quantities:latest-develop
that should have all the fixes.
Here there was a fix in Jetty 10 to make the idle-timeout work, however, I'm not sure what else could be.
Hi! I have been attempting to increase the idleTimeout but seems like it remain the default 30 seconds only.
Even though here idleTimout is set to 120 seconds, jetty is still enforcing it's default 30s as timeout. Below are few logs to make the case: