Open rmdort opened 7 years ago
@Corey-Zumar can you look into this?
@rmdort Can you provide some information about the latency SLO of your application and approximate input size?
@dcrankshaw The issue seems to be that we fail to handle exceptions occurring within the context of the future returned by decode_and_handle_predict()
within the query frontend. As a result, no HTTP response is returned and the connection eventually times out. httplib then throws the insufficiently descriptive connection aborted error for Python 2.7.
@rmdort This implies that there's an additional error occurring that we'll have trouble debugging until we can handle future exception propagation
@rmdort After #246 is merged, try replicating the error again and monitor Clipper's logs. Let me know if you see one of the following log messages:
[CLIPPER] Unexpected error: The associated promise has been destructed prior to the associated state becoming ready.
or
[QUERYPR...] Unexpected error: The associated promise has been destructed prior to the associated state becoming ready.
ok noted. Latency is set to 320ms. We are facing another issue where query_frontend fails to find the model.
[REDIS] Error with command "GET model-staging:0:0":
[QUERYPR...] No selection state found for query with user_id: 0 and label: model-staging
This is happening atleast once a day now. We had to re-deploy the model to fix it.
@rmdort #246 was merged. Please let me know what kind of log output you're encountering. Will look into the Redis issue
BadStatusLine error seems to be fixed.
But i am getting errors No connected models
when using predict
. This happens very very often. Why does the app disconnect from the model?
I always have to re-deploy the model to get it running again.
Any logs i should look at?
@rmdort Can you send log output for a container that disconnects in the way you describe? Full log output from the query frontend would also be very helpful.
Sorry for the late reply. The containers are always running, but clipper response says No connected models...
Actually we have migrated from clipper to kubernetes. We were getting really low concurrent connections (only 15) with clipper. And every now and often clipper disconnects from the model and complains No connected models
which has caused us a lot of problems in production.
Thanks for looking into this
No problem! Can you upload log output from the clipper/query_frontend
container when this issue occurs, as well as log output from the container that clipper can't find? It would help a lot with the debugging process.
This is from docker log
Right before it got disconnected from the model
[16:48:44.376][info] [REDIS] Successfully issued command "GET 59116d96300397120cfecdc0-staging:0:0"
[16:48:44.376][info] [QUERYPR...] Found 1 tasks
[16:48:44.826][info] [RPC] Found message to receive
[16:48:45.571][info] [REDIS] Successfully issued command "GET 59116d96300397120cfecdc0-staging:0:0"
[16:48:45.571][info] [QUERYPR...] Found 1 tasks
[16:48:47.265][info] [RPC] Found message to receive
[16:48:48.429][info] [RPC] Found message to receive
[16:48:49.830][info] [RPC] Found message to receive
[16:48:52.269][info] [RPC] Found message to receive
[16:48:53.109][info] [REDIS] Successfully issued command "GET 596f7a9af983421bdd3632e2-staging:0:0"
[16:48:53.109][info] [QUERYPR...] Found 1 tasks
[16:48:53.109][error] [TASKEXE...] Received task for unknown model: 596f7a9af983421bdd3632e2-staging : 1500482753
[16:48:53.109][error] [QUERYPR...] No connected models found for query with id: 29048
Re-deploy
[16:49:26.584][info] [REDIS] Successfully issued command "GET 596f7a9af983421bdd3632e2-staging:0:0"
[16:49:26.584][info] [QUERYPR...] Found 1 tasks
[16:49:26.584][error] [TASKEXE...] Received task for unknown model: 596f7a9af983421bdd3632e2-staging : 1500482753
[16:49:26.584][error] [QUERYPR...] No connected models found for query with id: 29065
[16:49:27.297][info] [RPC] Found message to receive
[16:49:27.599][info] [RPC] Found message to receive
[16:49:27.599][info] [RPC] Found message to receive
[16:49:27.599][info] [RPC] New container connected
[16:49:27.599][info] [RPC] Container added
[16:49:27.599][info] [REDIS] Successfully issued command "SELECT 3"
[16:49:27.599][info] [REDIS] MESSAGE: hset
[16:49:27.600][info] [REDIS] Successfully issued command "HMSET 596f7a9af983421bdd3632e2-staging,1500482753,0 model_id 596f7a9af983421bdd3632e2-staging:1500482753 model_name 596f7a9af983421bdd3632e2-staging mo
del_version 1500482753 model_replica_id 0 zmq_connection_id 27 batch_size 1 input_type strings"
[16:49:27.600][info] [REDIS] Successfully issued command "SELECT 3"
[16:49:27.600][info] [REDIS] Successfully issued command "HGETALL 596f7a9af983421bdd3632e2-staging,1500482753,0"
[16:49:27.600][info] [REDIS] model_id: 596f7a9af983421bdd3632e2-staging:1500482753
[16:49:27.600][info] [REDIS] model_name: 596f7a9af983421bdd3632e2-staging
[16:49:27.600][info] [REDIS] model_version: 1500482753
[16:49:27.600][info] [REDIS] model_replica_id: 0
[16:49:27.600][info] [REDIS] zmq_connection_id: 27
[16:49:27.600][info] [REDIS] batch_size: 1
[16:49:27.600][info] [REDIS] input_type: strings
[16:49:27.600][info] [CONTAINERS] Adding new container - model: 596f7a9af983421bdd3632e2-staging, version: 1500482753, connection ID: 27, replica ID: 0, input_type: strings
[16:49:27.600][info] [CONTAINERS] Creating new ModelContainer for model 596f7a9af983421bdd3632e2-staging:1500482753, id: 27
[16:49:27.600][info] [THREADPOOL] Work queue created for model 596f7a9af983421bdd3632e2-staging:1500482753, replica 0
[16:49:27.600][info] [CONTAINERS]
If there is any other log file i need to look at, let me know
Do you have log files for the container with name 596f7a9af983421bdd3632e2-staging
and version 1500482753
?
I came across the same issue. And I read through the reason in link http://clipper.ai/tutorials/troubleshooting/ and use the suningclipper_conn.cm.get_num_replicas(name="hello-world",version=3) to check the number of model container, the result is 0. The result may be just as described in the link that the model container can not be normally initiated.
If there is any solution, please post it.
There are many different reasons why your container cannot be started. Can you paste the logs from the crashed container?
We are getting this error occasionally with clipper when using
/predict
We are sending around 50 prediction requests to clipper (not parallel)
Logs doesnt show anything. What might be the cause?