pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.14k stars 835 forks source link

text_classification example not working #35

Closed chauhang closed 4 years ago

chauhang commented 4 years ago

The text_classification example is not working. Adding the model works. But on scaling the workers, one gets 500 error with 'failed to start workers".

After this error torchserve keeps trying to restart the workers and logs are flooded with errors, till one explicity scales back the workers for the model to 0.

Detailed error logs in console: 2020-02-16 02:35:53,352 [INFO ] W-9007-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - No module named 'text_classifier' 2020-02-16 02:35:53,352 [INFO ] W-9007-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Backend worker process die. 2020-02-16 02:35:53,352 [INFO ] W-9007-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Traceback (most recent call last): 2020-02-16 02:35:53,352 [INFO ] W-9007-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/ubuntu/anaconda3/envs/serve/lib/python3.8/site-packages/ts/model_service_worker.py", line 163, in 2020-02-16 02:35:53,352 [INFO ] W-9007-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - worker.run_server() 2020-02-16 02:35:53,352 [INFO ] W-9007-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/ubuntu/anaconda3/envs/serve/lib/python3.8/site-packages/ts/model_service_worker.py", line 141, in run_server 2020-02-16 02:35:53,352 [INFO ] W-9007-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - self.handle_connection(cl_socket) 2020-02-16 02:35:53,352 [INFO ] W-9007-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/ubuntu/anaconda3/envs/serve/lib/python3.8/site-packages/ts/model_service_worker.py", line 105, in handle_connection 2020-02-16 02:35:53,352 [INFO ] W-9007-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - service, result, code = self.load_model(msg) 2020-02-16 02:35:53,352 [INFO ] W-9007-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/ubuntu/anaconda3/envs/serve/lib/python3.8/site-packages/ts/model_service_worker.py", line 83, in load_model 2020-02-16 02:35:53,352 [INFO ] W-9007-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - service = model_loader.load(model_name, model_dir, handler, gpu, batch_size) 2020-02-16 02:35:53,352 [INFO ] W-9007-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/ubuntu/anaconda3/envs/serve/lib/python3.8/site-packages/ts/model_loader.py", line 107, in load 2020-02-16 02:35:53,352 [INFO ] W-9007-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - entry_point(None, service.context) 2020-02-16 02:35:53,352 [INFO ] W-9007-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/ubuntu/anaconda3/envs/serve/lib/python3.8/site-packages/ts/torch_handler/text_classifier.py", line 79, in handle 2020-02-16 02:35:53,352 [INFO ] W-9007-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - raise e 2020-02-16 02:35:53,352 [INFO ] W-9007-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/ubuntu/anaconda3/envs/serve/lib/python3.8/site-packages/ts/torch_handler/text_classifier.py", line 68, in handle 2020-02-16 02:35:53,352 [INFO ] W-9007-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - _service.initialize(context) 2020-02-16 02:35:53,352 [INFO ] W-9007-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/ubuntu/anaconda3/envs/serve/lib/python3.8/site-packages/ts/torch_handler/text_handler.py", line 20, in initialize 2020-02-16 02:35:53,352 [INFO ] epollEventLoopGroup-4-30 org.pytorch.serve.wlm.WorkerThread - 9007 Worker disconnected. WORKER_STARTED 2020-02-16 02:35:53,352 [INFO ] W-9007-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - self.source_vocab = torch.load(self.manifest['model']['sourceVocab']) 2020-02-16 02:35:53,352 [INFO ] W-9007-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/ubuntu/anaconda3/envs/serve/lib/python3.8/site-packages/torch/serialization.py", line 525, in load 2020-02-16 02:35:53,352 [INFO ] W-9007-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - with _open_file_like(f, 'rb') as opened_file: 2020-02-16 02:35:53,352 [INFO ] W-9007-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/ubuntu/anaconda3/envs/serve/lib/python3.8/site-packages/torch/serialization.py", line 212, in _open_file_like 2020-02-16 02:35:53,352 [DEBUG] W-9007-my_text_classifier_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

harshbafna commented 4 years ago

@chauhang : We could not reproduce this issue on our local (MAC) environment. Following are the steps we followed :

(base) USL07109:~ harsh_bafna$torchserve --start --model-store model_store

(base) USL07109:~ harsh_bafna$curl -X POST "http://localhost:8081/models?
url=my_text_classifier.mar"
{
  "status": "Model \"my_text_classifier\" registered"
}

(base) USL07109:~ harsh_bafna$ curl -X POST "http://localhost:8081/models
{
  "models": [
    {
      "modelName": "my_text_classifier",
      "modelUrl": "my_text_classifier.mar"
    }
  ]
}

(base) USL07109:~ harsh_bafna$ curl -X PUT "http://localhost:8081/models/my_text_classifier?min_worker=3&synchronous=true"
{
  "status": "Workers scaled"
}

(base) USL07109:~ harsh_bafna$curl -X POST http://127.0.0.1:8080/predictions/my_text_classifier -T serve/examples/text_classification/sample_text.txt
Sports

Could you please share the steps you followed? Or the MAR file used.

chauhang commented 4 years ago

The model archive was created using below command:

torch-model-archiver --model-name my_text_classifier --version 1.0 --model-file ../serve/examples/text_classification/model.py --serialized-file ../serve/examples/text_classification/model.pt --source-vocab ../serve/examples/text_classification/source_vocab.pt --handler text_classifier --extra-files ../serve/examples/text_classification/index_to_name.json

The mar file: https://sagemaker-pt.s3-us-west-2.amazonaws.com/my_text_classifier.mar

Added via the Management API using the s3 url.

harshbafna commented 4 years ago

@chauhang : Thanks for sharing the mar file, we could reproduce and fix the issue. I will update as soon as the PR is merged.

fbbradheintz commented 4 years ago

This works for me, but I had to change a bunch of paths in the instructions. Will edit.

mycpuorg commented 4 years ago

https://github.com/pytorch/serve/pull/44 should fix this.

harshbafna commented 4 years ago

The model archive was created using below command:

torch-model-archiver --model-name my_text_classifier --version 1.0 --model-file ../serve/examples/text_classification/model.py --serialized-file ../serve/examples/text_classification/model.pt --source-vocab ../serve/examples/text_classification/source_vocab.pt --handler text_classifier --extra-files ../serve/examples/text_classification/index_to_name.json

The mar file: https://sagemaker-pt.s3-us-west-2.amazonaws.com/my_text_classifier.mar

Added via the Management API using the s3 url.

Please note that, with this fix (#44 ), you will need to re-create the mar file.

chauhang commented 4 years ago

Still failing for me. Recreated the mar file, deployed via management api and scaled to 1 worker. On running the prediction getting error "No module named text_classifier"

harshbafna commented 4 years ago

@chauhang : No module named text_classifier is a misleading error message and has already been fixed as a part of issue #42 .

However, you should still be able to run the inference on this model. Could you please confirm?

chauhang commented 4 years ago

@harshbafna Still getting below errors, prediction fails with 'AttributeError: 'str' object has no attribute 'decode''

Model detials: [ { "modelName": "my_text_classifier", "modelVersion": "1.0", "modelUrl": "https://my-bucketname.s3-us-west-2.amazonaws.com/my_text_classifier.mar", "runtime": "python", "minWorkers": 1, "maxWorkers": 1, "batchSize": 1, "maxBatchDelay": 100, "loadedAtStartup": false, "workers": [ { "id": "9000", "startTime": "2020-02-22T01:07:56.031Z", "status": "READY", "gpu": false, "memoryUsage": 633675776 } ] } ] On running the prediction: { "code": 503, "type": "InternalServerException", "message": "Prediction failed" }

In the TorchServe console logs:

2020-02-22 01:08:13,992 [INFO ] W-9000-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/ubuntu/anaconda3/envs/serve/lib/python3.8/site-packages/ts/service.py", line 100, in predict 2020-02-22 01:08:13,992 [INFO ] W-9000-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - ret = self._entry_point(input_batch, self.context) 2020-02-22 01:08:13,992 [INFO ] W-9000-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/ubuntu/anaconda3/envs/serve/lib/python3.8/site-packages/ts/torch_handler/text_classifier.py", line 78, in handle 2020-02-22 01:08:13,992 [INFO ] W-9000-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - raise e 2020-02-22 01:08:13,993 [INFO ] W-9000-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/ubuntu/anaconda3/envs/serve/lib/python3.8/site-packages/ts/torch_handler/text_classifier.py", line 72, in handle 2020-02-22 01:08:13,993 [INFO ] W-9000-my_text_classifier_1.0 ACCESS_LOG - /67.161.9.204:65476 "POST /predictions/my_text_classifier HTTP/1.1" 503 12 2020-02-22 01:08:13,993 [INFO ] W-9000-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - data = _service.preprocess(data) 2020-02-22 01:08:13,993 [INFO ] W-9000-my_text_classifier_1.0 TS_METRICS - Requests5XX.Count:1|#Level:Host|#hostname:ip-172-31-39-207,timestamp:null 2020-02-22 01:08:13,993 [INFO ] W-9000-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/ubuntu/anaconda3/envs/serve/lib/python3.8/site-packages/ts/torch_handler/text_classifier.py", line 32, in preprocess 2020-02-22 01:08:13,993 [INFO ] W-9000-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - text = text.decode('utf-8') 2020-02-22 01:08:13,993 [DEBUG] W-9000-my_text_classifier_1.0 org.pytorch.serve.wlm.Job - Waiting time: 0, Inference time: 3 2020-02-22 01:08:13,993 [INFO ] W-9000-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - AttributeError: 'str' object has no attribute 'decode'

harshbafna commented 4 years ago

@chauhang : Looks like the text_classifier default handler is not able to parse the input text. Could you please share the input used?

chauhang commented 4 years ago

@harshbafna It was the sample text provided as part of the example https://github.com/pytorch/serve/blob/master/examples/text_classification/sample_text.txt

harshbafna commented 4 years ago

@chauhang The following command works absolutely fine for me

curl -X POST http://127.0.0.1:8080/predictions/my_tc -T examples/text_classification/sample_text.txt
Sports
fbbradheintz commented 4 years ago

A bug has been introduced in line 50 of text_classifier.py:

2020-03-09 19:58:15,084 [INFO ] W-9003-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Invoking custom service failed.
2020-03-09 19:58:15,085 [INFO ] W-9003-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Traceback (most recent call last):
2020-03-09 19:58:15,085 [INFO ] W-9003-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ts/service.py", line 100, in predict
2020-03-09 19:58:15,085 [INFO ] W-9003-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     ret = self._entry_point(input_batch, self.context)
2020-03-09 19:58:15,085 [INFO ] W-9003-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ts/torch_handler/text_classifier.py", line 78, in handle
2020-03-09 19:58:15,085 [INFO ] W-9003-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     raise e
2020-03-09 19:58:15,085 [INFO ] W-9003-my_tc_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 72
2020-03-09 19:58:15,085 [INFO ] W-9003-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ts/torch_handler/text_classifier.py", line 73, in handle
2020-03-09 19:58:15,085 [INFO ] W-9003-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     data = _service.inference(data)
2020-03-09 19:58:15,086 [INFO ] W-9003-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ts/torch_handler/text_classifier.py", line 50, in inference
2020-03-09 19:58:15,086 [INFO ] W-9003-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     output = self.model.forward(inputs, torch.tensor([0].to(self.device)))
2020-03-09 19:58:15,086 [INFO ] W-9003-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - AttributeError: 'list' object has no attribute 'to'

That line of code will fail every time, and clearly was not tested before being checked in.

harshbafna commented 4 years ago

@fbbradheintz Fixed and tested in stage_release branch.

Commit : https://github.com/pytorch/serve/commit/12c9102cd2f33172f7c23af658754f2e72aab97d

fbbradheintz commented 4 years ago

Works now. Please close after merging.