pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.19k stars 858 forks source link

FasterRCNN scripted fails on forward #932

Open lgvaz opened 3 years ago

lgvaz commented 3 years ago

Context

Your Environment

Current behavior and steps to reproduce

I'm following the example on fasterrcnn with the only difference that I'm using a scripted model instead of eager mode, this is what I'm doing:

from torchvision.models.detection.faster_rcnn import fasterrcnn_resnet50_fpn

model = fasterrcnn_resnet50_fpn()
scripted_model = torch.jit.script(model)
torch.jit.save(scripted_model, 'scripted_fasterrcnn_resnet50_fpn.pt')

I create the .mar file: torch-model-archiver --model-name coco50 --version 1.0 --serialized-file scripted_fasterrcnn_resnet50_fpn.pt --handler object_detector

And start torchserve with: torchserve --start --ncs --model-store model_store --models coco=coco50.mar

But when making a request in the exact same way as described in the tutorial, I get:

RuntimeError: forward() Expected a value of type 'List[Tensor]' for argument 'images' but instead found type 'Tensor'.

Complete stack trace

Torchserve version: 0.3.0
TS Home: /home/lgvaz/miniconda3/envs/serve/lib/python3.8/site-packages
Current directory: /home/lgvaz/git/serve
Temp directory: /tmp
Number of GPUs: 1
Number of CPUs: 8
Max heap size: 2992 M
Python executable: /home/lgvaz/miniconda3/envs/serve/bin/python
Config file: N/A
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
Model Store: /home/lgvaz/git/serve/model_store
Initial Models: fastrcnn=fastrcnn_eager.mar,coco=coco50.mar
Log dir: /home/lgvaz/git/serve/logs
Metrics dir: /home/lgvaz/git/serve/logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 1
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: false
Metrics report format: prometheus
Enable metrics API: true
2020-12-29 18:03:57,741 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: fastrcnn_eager.mar
2020-12-29 18:04:01,358 [INFO ] main org.pytorch.serve.archive.ModelArchive - eTag 954ca742bc7244eabed3e55a7d8c794c
2020-12-29 18:04:01,372 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model fastrcnn
2020-12-29 18:04:01,373 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model fastrcnn
2020-12-29 18:04:01,373 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model fastrcnn loaded.
2020-12-29 18:04:01,373 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: fastrcnn, count: 1
2020-12-29 18:04:01,383 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: coco50.mar
2020-12-29 18:04:01,516 [INFO ] W-9000-fastrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Listening on port: /tmp/.ts.sock.9000
2020-12-29 18:04:01,517 [INFO ] W-9000-fastrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - [PID]8960
2020-12-29 18:04:01,518 [INFO ] W-9000-fastrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Torch worker started.
2020-12-29 18:04:01,518 [INFO ] W-9000-fastrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Python runtime: 3.8.5
2020-12-29 18:04:01,518 [DEBUG] W-9000-fastrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-fastrcnn_1.0 State change null -> WORKER_STARTED
2020-12-29 18:04:01,537 [INFO ] W-9000-fastrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9000
2020-12-29 18:04:01,627 [INFO ] W-9000-fastrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /tmp/.ts.sock.9000.
2020-12-29 18:04:04,354 [INFO ] W-9000-fastrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Missing the index_to_name.json file. Inference output will not include class name.
2020-12-29 18:04:04,361 [INFO ] W-9000-fastrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 2682
2020-12-29 18:04:04,361 [DEBUG] W-9000-fastrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-fastrcnn_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2020-12-29 18:04:04,362 [INFO ] W-9000-fastrcnn_1.0 TS_METRICS - W-9000-fastrcnn_1.0.ms:2982|#Level:Host|#hostname:lgvaz-GA-78LMT-USB3,timestamp:1609275844
2020-12-29 18:04:04,362 [INFO ] W-9000-fastrcnn_1.0 TS_METRICS - WorkerThreadTime.ms:47|#Level:Host|#hostname:lgvaz-GA-78LMT-USB3,timestamp:null
2020-12-29 18:04:05,080 [INFO ] main org.pytorch.serve.archive.ModelArchive - eTag f78c7ec452d243e893396351b33d7c42
2020-12-29 18:04:05,081 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model coco
2020-12-29 18:04:05,081 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model coco
2020-12-29 18:04:05,082 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model coco loaded.
2020-12-29 18:04:05,082 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: coco, count: 1
2020-12-29 18:04:05,083 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2020-12-29 18:04:05,111 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080
2020-12-29 18:04:05,112 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2020-12-29 18:04:05,114 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081
2020-12-29 18:04:05,115 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2020-12-29 18:04:05,116 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
2020-12-29 18:04:05,195 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Listening on port: /tmp/.ts.sock.9001
2020-12-29 18:04:05,196 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - [PID]8977
2020-12-29 18:04:05,196 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Torch worker started.
2020-12-29 18:04:05,196 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Python runtime: 3.8.5
2020-12-29 18:04:05,196 [DEBUG] W-9001-coco_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-coco_1.0 State change null -> WORKER_STARTED
2020-12-29 18:04:05,197 [INFO ] W-9001-coco_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9001
2020-12-29 18:04:05,200 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /tmp/.ts.sock.9001.
Model server started.
2020-12-29 18:04:05,429 [INFO ] pool-2-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:lgvaz-GA-78LMT-USB3,timestamp:1609275845
2020-12-29 18:04:05,431 [INFO ] pool-2-thread-1 TS_METRICS - DiskAvailable.Gigabytes:193.8439178466797|#Level:Host|#hostname:lgvaz-GA-78LMT-USB3,timestamp:1609275845
2020-12-29 18:04:05,432 [INFO ] pool-2-thread-1 TS_METRICS - DiskUsage.Gigabytes:222.42626953125|#Level:Host|#hostname:lgvaz-GA-78LMT-USB3,timestamp:1609275845
2020-12-29 18:04:05,432 [INFO ] pool-2-thread-1 TS_METRICS - DiskUtilization.Percent:53.4|#Level:Host|#hostname:lgvaz-GA-78LMT-USB3,timestamp:1609275845
2020-12-29 18:04:05,433 [INFO ] pool-2-thread-1 TS_METRICS - MemoryAvailable.Megabytes:5869.6953125|#Level:Host|#hostname:lgvaz-GA-78LMT-USB3,timestamp:1609275845
2020-12-29 18:04:05,434 [INFO ] pool-2-thread-1 TS_METRICS - MemoryUsed.Megabytes:5476.75|#Level:Host|#hostname:lgvaz-GA-78LMT-USB3,timestamp:1609275845
2020-12-29 18:04:05,434 [INFO ] pool-2-thread-1 TS_METRICS - MemoryUtilization.Percent:50.9|#Level:Host|#hostname:lgvaz-GA-78LMT-USB3,timestamp:1609275845
2020-12-29 18:04:06,571 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Missing the index_to_name.json file. Inference output will not include class name.
2020-12-29 18:04:06,572 [INFO ] W-9001-coco_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 1352
2020-12-29 18:04:06,572 [DEBUG] W-9001-coco_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-coco_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2020-12-29 18:04:06,572 [INFO ] W-9001-coco_1.0 TS_METRICS - W-9001-coco_1.0.ms:1490|#Level:Host|#hostname:lgvaz-GA-78LMT-USB3,timestamp:1609275846
2020-12-29 18:04:06,573 [INFO ] W-9001-coco_1.0 TS_METRICS - WorkerThreadTime.ms:21|#Level:Host|#hostname:lgvaz-GA-78LMT-USB3,timestamp:null
2020-12-29 18:04:17,341 [INFO ] W-9000-fastrcnn_1.0-stdout MODEL_METRICS - HandlerTime.Milliseconds:6827.25|#ModelName:fastrcnn,Level:Model|#hostname:lgvaz-GA-78LMT-USB3,requestID:27421d97-1d90-496d-a8ad-aa5258b0a9fb,timestamp:1609275857
2020-12-29 18:04:17,341 [INFO ] W-9000-fastrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 6829
2020-12-29 18:04:17,342 [INFO ] W-9000-fastrcnn_1.0-stdout MODEL_METRICS - PredictionTime.Milliseconds:6827.31|#ModelName:fastrcnn,Level:Model|#hostname:lgvaz-GA-78LMT-USB3,requestID:27421d97-1d90-496d-a8ad-aa5258b0a9fb,timestamp:1609275857
2020-12-29 18:04:17,343 [INFO ] W-9000-fastrcnn_1.0 ACCESS_LOG - /127.0.0.1:54082 "PUT /predictions/fastrcnn HTTP/1.1" 200 6840
2020-12-29 18:04:17,344 [INFO ] W-9000-fastrcnn_1.0 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:lgvaz-GA-78LMT-USB3,timestamp:null
2020-12-29 18:04:17,344 [DEBUG] W-9000-fastrcnn_1.0 org.pytorch.serve.job.Job - Waiting time ns: 312882, Backend time ns: 6835027952
2020-12-29 18:04:17,345 [INFO ] W-9000-fastrcnn_1.0 TS_METRICS - QueueTime.ms:0|#Level:Host|#hostname:lgvaz-GA-78LMT-USB3,timestamp:null
2020-12-29 18:04:17,345 [INFO ] W-9000-fastrcnn_1.0 TS_METRICS - WorkerThreadTime.ms:7|#Level:Host|#hostname:lgvaz-GA-78LMT-USB3,timestamp:null
2020-12-29 18:04:21,745 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Invoking custom service failed.
2020-12-29 18:04:21,745 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Traceback (most recent call last):
2020-12-29 18:04:21,745 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/lgvaz/miniconda3/envs/serve/lib/python3.8/site-packages/ts/service.py", line 100, in predict
2020-12-29 18:04:21,746 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     ret = self._entry_point(input_batch, self.context)
2020-12-29 18:04:21,746 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/lgvaz/miniconda3/envs/serve/lib/python3.8/site-packages/ts/torch_handler/base_handler.py", line 197, in handle
2020-12-29 18:04:21,746 [INFO ] W-9001-coco_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 42
2020-12-29 18:04:21,746 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     output = self.inference(data_preprocess)
2020-12-29 18:04:21,746 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/lgvaz/miniconda3/envs/serve/lib/python3.8/site-packages/ts/torch_handler/base_handler.py", line 157, in inference
2020-12-29 18:04:21,746 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     results = self.model(marshalled_data, *args, **kwargs)
2020-12-29 18:04:21,747 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/lgvaz/miniconda3/envs/serve/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
2020-12-29 18:04:21,747 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     result = self.forward(*input, **kwargs)
2020-12-29 18:04:21,747 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: forward() Expected a value of type 'List[Tensor]' for argument 'images' but instead found type 'Tensor'.
2020-12-29 18:04:21,747 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Position: 1
2020-12-29 18:04:21,747 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Value: tensor([[[[0.9765, 0.9490, 0.9490,  ..., 0.4980, 0.5020, 0.5098],
2020-12-29 18:04:21,748 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -           [0.9686, 0.9490, 0.9451,  ..., 0.4902, 0.4902, 0.4902],
2020-12-29 18:04:21,748 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -           [0.9725, 0.9647, 0.9608,  ..., 0.4902, 0.4824, 0.4784],
2020-12-29 18:04:21,748 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -           ...,
2020-12-29 18:04:21,748 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -           [0.1843, 0.1765, 0.1647,  ..., 0.1608, 0.1569, 0.1569],
2020-12-29 18:04:21,748 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -           [0.1882, 0.1765, 0.1686,  ..., 0.1647, 0.1451, 0.1647],
2020-12-29 18:04:21,748 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -           [0.1882, 0.1804, 0.1686,  ..., 0.1529, 0.1333, 0.1529]],
2020-12-29 18:04:21,748 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - 
2020-12-29 18:04:21,749 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -          [[0.9176, 0.8902, 0.8902,  ..., 0.5647, 0.5686, 0.5765],
2020-12-29 18:04:21,749 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -           [0.9137, 0.8941, 0.8902,  ..., 0.5569, 0.5569, 0.5569],
2020-12-29 18:04:21,749 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -           [0.9255, 0.9176, 0.9137,  ..., 0.5569, 0.5490, 0.5451],
2020-12-29 18:04:21,749 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -           ...,
2020-12-29 18:04:21,749 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -           [0.1804, 0.1725, 0.1608,  ..., 0.1608, 0.1569, 0.1569],
2020-12-29 18:04:21,749 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -           [0.1843, 0.1725, 0.1647,  ..., 0.1686, 0.1490, 0.1686],
2020-12-29 18:04:21,749 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -           [0.1843, 0.1765, 0.1647,  ..., 0.1569, 0.1373, 0.1569]],
2020-12-29 18:04:21,750 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - 
2020-12-29 18:04:21,750 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -          [[0.8902, 0.8627, 0.8627,  ..., 0.5412, 0.5451, 0.5529],
2020-12-29 18:04:21,750 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -           [0.8706, 0.8510, 0.8471,  ..., 0.5255, 0.5255, 0.5255],
2020-12-29 18:04:21,750 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -           [0.8627, 0.8549, 0.8510,  ..., 0.5176, 0.5098, 0.5059],
2020-12-29 18:04:21,750 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -           ...,
2020-12-29 18:04:21,750 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -           [0.2000, 0.1922, 0.1804,  ..., 0.1686, 0.1647, 0.1647],
2020-12-29 18:04:21,751 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -           [0.2039, 0.1922, 0.1843,  ..., 0.1765, 0.1569, 0.1765],
2020-12-29 18:04:21,751 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -           [0.2039, 0.1961, 0.1843,  ..., 0.1647, 0.1451, 0.1647]]]])
2020-12-29 18:04:21,751 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Declaration: forward(__torch__.torchvision.models.detection.faster_rcnn.FasterRCNN self, Tensor[] images, Dict(str, Tensor)[]? targets=None) -> ((Dict(str, Tensor), Dict(str, Tensor)[]))
2020-12-29 18:04:21,751 [INFO ] W-9001-coco_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Cast error details: Unable to cast Python instance to C++ type (compile in debug mode for details)
2020-12-29 18:04:21,757 [INFO ] W-9001-coco_1.0 ACCESS_LOG - /127.0.0.1:54084 "PUT /predictions/coco HTTP/1.1" 503 55
2020-12-29 18:04:21,758 [INFO ] W-9001-coco_1.0 TS_METRICS - Requests5XX.Count:1|#Level:Host|#hostname:lgvaz-GA-78LMT-USB3,timestamp:null
2020-12-29 18:04:21,758 [DEBUG] W-9001-coco_1.0 org.pytorch.serve.job.Job - Waiting time ns: 214774, Inference time ns: 55646956
2020-12-29 18:04:21,758 [INFO ] W-9001-coco_1.0 TS_METRICS - WorkerThreadTime.ms:14|#Level:Host|#hostname:lgvaz-GA-78LMT-USB3,timestamp:null
dhaniram-kshirsagar commented 3 years ago

@lgvaz Will look into it. In the meantime can you check this https://discuss.pytorch.org/t/torch-jit-trace-is-not-working-with-mask-rcnn/83244

lgvaz commented 3 years ago

Hi hi, any updates on this?

Looking in hindsight I should've written a message instead of just giving the thumbs up on your message, but yeah, that's exactly the problem I'm facing, and the solution described there (of wrapping the image in a list) does solve the issue.

Now, how can this be addressed with torchserve?

chauhang commented 3 years ago

@lgvaz Please use eager mode deployment for now. Torchscript support for detectron models is coming as part of PyTorch 1.8 release next week

lgvaz commented 3 years ago

Torchscript support for detectron models

@chauhang just to be clear, I'm using torchvision and not detectron

ozancaglayan commented 2 years ago

What's the situation on this, did you try with recent versions of the toolkits @lgvaz ?