Open KilianMichiels opened 2 years ago
Per #1615
I also am having this problem. My model sometimes loads correctly, sometimes doesn't. When it does load, the response time is c. 119s so I need to increase this threshold to load the model reliably. When I set default_response_timeout
and/or TS_DEFAULT_RESPONSE_TIMEOUT
nothing is different.
@msaroufim Is there a plan to fix this? This is a big issue for me. Thanks!
The logs below show an example of when model loading fails:
org.pytorch.serve.wlm.WorkerThread - Backend worker error
org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:199) ~[model-server.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Backend worker error
org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:199) ~[model-server.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Hi @alexgaskell10,
As this received the wontfix
label, and I did not get the impression this was high priority, I started looking a bit more into the Java server code.
It turns out the relevant parameter setting the 120s can be found here:
public RegisterWorkflowRequest(QueryStringDecoder decoder) {
workflowName = NettyUtils.getParameter(decoder, "workflow_name", null);
responseTimeout = NettyUtils.getIntParameter(decoder, "response_timeout", 120); // This is the one!
workflowUrl = NettyUtils.getParameter(decoder, "url", null);
s3SseKms = Boolean.parseBoolean(NettyUtils.getParameter(decoder, "s3_sse_kms", "false"));
}
So as a temporary fix you could replace this with your own desired value and build TorchServe from source:
# Assuming you cloned the repo.
cd /serve/
# Replace the complete line using sed
sed -i 's/getIntParameter(decoder, "response_timeout", 120)/getIntParameter(decoder, "response_timeout", 300)/' frontend/server/src/main/java/org/pytorch/serve/workflow/messages/RegisterWorkflowRequest.java
# Build from source
python3.8 ts_scripts/install_dependencies.py --cuda=cu113 --environment=dev
python3.8 ts_scripts/install_from_src.py
Note: I ended up using v0.6.0 as the scripts in older versions seemed to fail for my application.
Hope this helps! Good luck!
@Michielskilian ah excellent thanks, I'll do this!
Edit: To add to this, the 'default_response_timeout' in config.properties does seem to be working fine actually.
Hi @alexgaskell10,
I tried the simple example I described above with v0.6.0 but the bug still occurs. The timeout remains at 120s, I checked if the config.properties is read properly by changing the ports for the management and inference URLs, and it does seem to pick up these changes.
Could you provide some more details on your setup so I can see what the differences are?
@Michielskilian, I also experience this bug in v0.6.1.
Context
I'm trying to increase the configuration parameter
default_response_timeout
to a value higher than 120 seconds but it seems neither thedefault_response_timeout
parameter nor the environment variableTS_DEFAULT_RESPONSE_TIMEOUT
are read correctly.Your Environment
Installed using source? [yes/no]: no, installed using pip
Are you planning to deploy it using docker container? [yes/no]: yes
Is it a CPU or GPU environment?: GPU
Using a default/custom handler? [If possible upload/share custom handler/model]: yes, see simple example below.
What kind of model is it e.g. vision, text, audio?: NA
Are you planning to use local models from model-store or public url being used e.g. from S3 bucket etc.?
Provide config.properties, logs [ts.log] and parameters used for model registration/update APIs:
Link to your project [if any]: NA
Expected Behavior
I set the parameter
default_response_timeout
in the config.properties file to a high value. The worker does not timeout before this value is reached.Current Behavior
I set the parameter
default_response_timeout
in the config.properties file to any different value. The worker still uses the default timeout value of 120 seconds.Possible Solution
I do not have enough knowledge on the internal workings of TorchServe to provide a solution at this time. This is as far as I got to tracing down where the variable is read: https://github.com/pytorch/serve/blob/30f83500b0850e26ec55581f48a9307b1986f9f9/frontend/server/src/main/java/org/pytorch/serve/util/ConfigManager.java#L62
Steps to Reproduce
Get any model.mar file using the examples (so first git clone this repo).
Create a simple handler dummy.py
Archive workflow
Create the config.properties file
TS_DEFAULT_RESPONSE_TIMEOUT
environment variable andenable_envvars_config
set totrue
.Failure Logs [if any]