ucbrise / clipper

A low-latency prediction-serving system
http://clipper.ai
Apache License 2.0
1.4k stars 280 forks source link

Error when using predict (Connection aborted) #245

Open rmdort opened 7 years ago

rmdort commented 7 years ago

We are getting this error occasionally with clipper when using /predict

We are sending around 50 prediction requests to clipper (not parallel)

ConnectionError: ('Connection aborted.', BadStatusLine("''",))

Logs doesnt show anything. What might be the cause?

dcrankshaw commented 7 years ago

@Corey-Zumar can you look into this?

Corey-Zumar commented 7 years ago

@rmdort Can you provide some information about the latency SLO of your application and approximate input size?

Corey-Zumar commented 7 years ago

@dcrankshaw The issue seems to be that we fail to handle exceptions occurring within the context of the future returned by decode_and_handle_predict() within the query frontend. As a result, no HTTP response is returned and the connection eventually times out. httplib then throws the insufficiently descriptive connection aborted error for Python 2.7.

@rmdort This implies that there's an additional error occurring that we'll have trouble debugging until we can handle future exception propagation

Corey-Zumar commented 7 years ago

@rmdort After #246 is merged, try replicating the error again and monitor Clipper's logs. Let me know if you see one of the following log messages:

[CLIPPER] Unexpected error: The associated promise has been destructed prior to the associated state becoming ready.

or

[QUERYPR...] Unexpected error: The associated promise has been destructed prior to the associated state becoming ready.
rmdort commented 7 years ago

ok noted. Latency is set to 320ms. We are facing another issue where query_frontend fails to find the model.

[REDIS] Error with command "GET model-staging:0:0":
[QUERYPR...] No selection state found for query with user_id: 0 and label: model-staging

This is happening atleast once a day now. We had to re-deploy the model to fix it.

Corey-Zumar commented 7 years ago

@rmdort #246 was merged. Please let me know what kind of log output you're encountering. Will look into the Redis issue

rmdort commented 7 years ago

BadStatusLine error seems to be fixed.

But i am getting errors No connected models when using predict. This happens very very often. Why does the app disconnect from the model?

I always have to re-deploy the model to get it running again.

Any logs i should look at?

Corey-Zumar commented 7 years ago

@rmdort Can you send log output for a container that disconnects in the way you describe? Full log output from the query frontend would also be very helpful.

rmdort commented 7 years ago

Sorry for the late reply. The containers are always running, but clipper response says No connected models...

Actually we have migrated from clipper to kubernetes. We were getting really low concurrent connections (only 15) with clipper. And every now and often clipper disconnects from the model and complains No connected models which has caused us a lot of problems in production.

Thanks for looking into this

Corey-Zumar commented 7 years ago

No problem! Can you upload log output from the clipper/query_frontend container when this issue occurs, as well as log output from the container that clipper can't find? It would help a lot with the debugging process.

rmdort commented 7 years ago

This is from docker log

Right before it got disconnected from the model

[16:48:44.376][info]       [REDIS] Successfully issued command "GET 59116d96300397120cfecdc0-staging:0:0"
[16:48:44.376][info]  [QUERYPR...] Found 1 tasks
[16:48:44.826][info]         [RPC] Found message to receive
[16:48:45.571][info]       [REDIS] Successfully issued command "GET 59116d96300397120cfecdc0-staging:0:0"
[16:48:45.571][info]  [QUERYPR...] Found 1 tasks
[16:48:47.265][info]         [RPC] Found message to receive
[16:48:48.429][info]         [RPC] Found message to receive
[16:48:49.830][info]         [RPC] Found message to receive
[16:48:52.269][info]         [RPC] Found message to receive
[16:48:53.109][info]       [REDIS] Successfully issued command "GET 596f7a9af983421bdd3632e2-staging:0:0"
[16:48:53.109][info]  [QUERYPR...] Found 1 tasks
[16:48:53.109][error] [TASKEXE...] Received task for unknown model: 596f7a9af983421bdd3632e2-staging : 1500482753
[16:48:53.109][error] [QUERYPR...] No connected models found for query with id: 29048

Re-deploy

[16:49:26.584][info]       [REDIS] Successfully issued command "GET 596f7a9af983421bdd3632e2-staging:0:0"
[16:49:26.584][info]  [QUERYPR...] Found 1 tasks
[16:49:26.584][error] [TASKEXE...] Received task for unknown model: 596f7a9af983421bdd3632e2-staging : 1500482753
[16:49:26.584][error] [QUERYPR...] No connected models found for query with id: 29065
[16:49:27.297][info]         [RPC] Found message to receive
[16:49:27.599][info]         [RPC] Found message to receive
[16:49:27.599][info]         [RPC] Found message to receive
[16:49:27.599][info]         [RPC] New container connected
[16:49:27.599][info]         [RPC] Container added
[16:49:27.599][info]       [REDIS] Successfully issued command "SELECT 3"
[16:49:27.599][info]       [REDIS] MESSAGE: hset
[16:49:27.600][info]       [REDIS] Successfully issued command "HMSET 596f7a9af983421bdd3632e2-staging,1500482753,0 model_id 596f7a9af983421bdd3632e2-staging:1500482753 model_name 596f7a9af983421bdd3632e2-staging mo
del_version 1500482753 model_replica_id 0 zmq_connection_id 27 batch_size 1 input_type strings"
[16:49:27.600][info]       [REDIS] Successfully issued command "SELECT 3"
[16:49:27.600][info]       [REDIS] Successfully issued command "HGETALL 596f7a9af983421bdd3632e2-staging,1500482753,0"
[16:49:27.600][info]       [REDIS]       model_id: 596f7a9af983421bdd3632e2-staging:1500482753
[16:49:27.600][info]       [REDIS]       model_name: 596f7a9af983421bdd3632e2-staging
[16:49:27.600][info]       [REDIS]       model_version: 1500482753
[16:49:27.600][info]       [REDIS]       model_replica_id: 0
[16:49:27.600][info]       [REDIS]       zmq_connection_id: 27
[16:49:27.600][info]       [REDIS]       batch_size: 1
[16:49:27.600][info]       [REDIS]       input_type: strings
[16:49:27.600][info]  [CONTAINERS] Adding new container - model: 596f7a9af983421bdd3632e2-staging, version: 1500482753, connection ID: 27, replica ID: 0, input_type: strings
[16:49:27.600][info]  [CONTAINERS] Creating new ModelContainer for model 596f7a9af983421bdd3632e2-staging:1500482753, id: 27
[16:49:27.600][info]  [THREADPOOL] Work queue created for model 596f7a9af983421bdd3632e2-staging:1500482753, replica 0
[16:49:27.600][info]  [CONTAINERS] 

If there is any other log file i need to look at, let me know

Corey-Zumar commented 7 years ago

Do you have log files for the container with name 596f7a9af983421bdd3632e2-staging and version 1500482753 ?

RayTsui commented 6 years ago

I came across the same issue. And I read through the reason in link http://clipper.ai/tutorials/troubleshooting/ and use the suningclipper_conn.cm.get_num_replicas(name="hello-world",version=3) to check the number of model container, the result is 0. The result may be just as described in the link that the model container can not be normally initiated.

If there is any solution, please post it.

dcrankshaw commented 6 years ago

There are many different reasons why your container cannot be started. Can you paste the logs from the crashed container?