pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.19k stars 858 forks source link

SSDLite Object Detection doesn't work with object_detector handler #2043

Open chandan-labelfuse opened 1 year ago

chandan-labelfuse commented 1 year ago

🐛 Describe the bug

I am trying to serve ssdlite320_mobilenet_v3_large model using torchserve. I created the custom model.py file, downloaded the weights and ran torch-model-archiver. Yet it gives the error of not loading the model properly

Error logs

2022-12-21T16:41:30,460 [INFO ] W-9000-ssdlitemobilenet_1.0-stdout MODEL_LOG - raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( 2022-12-21T16:41:30,460 [INFO ] W-9000-ssdlitemobilenet_1.0-stdout MODEL_LOG - RuntimeError: Error(s) in loading state_dict for SSDLiteObjectDetector: 2022-12-21T16:41:30,459 [INFO ] epollEventLoopGroup-5-8 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2022-12-21T16:41:30,461 [DEBUG] W-9000-ssdlitemobilenet_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2022-12-21T16:41:30,461 [DEBUG] W-9000-ssdlitemobilenet_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?] at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?] at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:191) [model-server.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] 2022-12-21T16:41:30,462 [WARN ] W-9000-ssdlitemobilenet_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: ssdlitemobilenet, error: Worker died. 2022-12-21T16:41:30,462 [DEBUG] W-9000-ssdlitemobilenet_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-ssdlitemobilenet_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2022-12-21T16:41:30,462 [WARN ] W-9000-ssdlitemobilenet_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-ssdlitemobilenet_1.0-stderr 2022-12-21T16:41:30,463 [WARN ] W-9000-ssdlitemobilenet_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-ssdlitemobilenet_1.0-stdout 2022-12-21T16:41:30,463 [INFO ] W-9000-ssdlitemobilenet_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 21 seconds. 2022-12-21T16:41:30,462 [INFO ] W-9000-ssdlitemobilenet_1.0-stdout MODEL_LOG - Missing key(s) in state_dict: "backbone.body.0.0.weight", "backbone.body.0.1.weight", "backbone.body.0.1.bias", "backbone.body.0.1.running_mean", "backbone.body.0.1.running_var", "backbone.body.1.block.0.0.weight", "backbone.body.1.block.0.1.weight", "backbone.body.1.block.0.1.bias", "backbone.body.1.block.0.1.running_mean", "backbone.body.1.block.0.1.running_var", "backbone.body.1.block.1.0.weight", "backbone.body.1.block.1.1.weight", "backbone.body.1.block.1.1.bias", "backbone.body.1.block.1.1.running_mean", "backbone.body.1.block.1.1.running_var", "backbone.body.2.block.0.0.weight", "backbone.body.2.block.0.1.weight", "backbone.body.2.block.0.1.bias", "backbone.body.2.block.0.1.running_mean", "backbone.body.2.block.0.1.running_var", "backbone.body.2.block.1.0.weight", "backbone.body.2.block.1.1.weight", "backbone.body.2.block.1.1.bias", "backbone.body.2.block.1.1.running_mean", "backbone.body.2.block.1.1.running_var", "backbone.body.2.block.2.0.weight", "backbone.body.2.block.2.1.weight", "backbone.body.2.block.2.1.bias", "backbone.body.2.block.2.1.running_mean", "backbone.body.2.block.2.1.running_var", "backbone.body.3.block.0.0.weight", "backbone.body.3.block.0.1.weight", "backbone.body.3.block.0.1.bias", "backbone.body.3.block.0.1.running_mean", "backbone.body.3.block.0.1.running_var", "backbone.body.3.block.1.0.weight", "backbone.body.3.block.1.1.weight", "backbone.body.3.block.1.1.bias", "backbone.body.3.block.1.1.running_mean", "backbone.body.3.block.1.1.running_var", "backbone.body.3.block.2.0.weight", "backbone.body.3.block.2.1.weight", "backbone.body.3.block.2.1.bias", "backbone.body.3.block.2.1.running_mean", "backbone.body.3.block.2.1.running_var", "backbone.body.4.block.0.0.weight", "backbone.body.4.block.0.1.weight", "backbone.body.4.block.0.1.bias", "backbone.body.4.block.0.1.running_mean", "backbone.body.4.block.0.1.running_var", "backbone.body.4.block.1.0.weight", "backbone.body.4.block.1.1.weight", "backbone.body.4.block.1.1.bias", "backbone.body.4.block.1.1.running_mean", "backbone.body.4.block.1.1.running_var", "backbone.body.4.block.2.fc1.weight", "backbone.body.4.block.2.fc1.bias", "backbone.body.4.block.2.fc2.weight", "backbone.body.4.block.2.fc2.bias", "backbone.body.4.block.3.0.weight", "backbone.body.4.block.3.1.weight", "backbone.body.4.block.3.1.bias", "backbone.body.4.block.3.1.running_mean", "backbone.body.4.block.3.1.running_var", "backbone.body.5.block.0.0.weight", "backbone.body.5.block.0.1.weight", "backbone.body.5.block.0.1.bias", "backbone.body.5.block.0.1.running_mean", "backbone.body.5.block.0.1.running_var", "backbone.body.5.block.1.0.weight", "backbone.body.5.block.1.1.weight", "backbone.body.5.block.1.1.bias", "backbone.body.5.block.1.1.running_mean", "backbone.body.5.block.1.1.running_var", "backbone.body.5.block.2.fc1.weight", "backbone.body.5.block.2.fc1.bias", "backbone.body.5.block.2.fc2.weight", "backbone.body.5.block.2.fc2.bias", "backbone.body.5.block.3.0.weight", "backbone.body.5.block.3.1.weight", "backbone.body.5.block.3.1.bias", "backbone.body.5.block.3.1.running_mean", "backbone.body.5.block.3.1.running_var", "backbone.body.6.block.0.0.weight", "backbone.body.6.block.0.1.weight", "backbone.body.6.block.0.1.bias", "backbone.body.6.block.0.1.running_mean", "backbone.body.6.block.0.1.running_var", "backbone.body.6.block.1.0.weight", "backbone.body.6.block.1.1.weight", "backbone.body.6.block.1.1.bias", "backbone.body.6.block.1.1.running_mean", "backbone.body.6.block.1.1.running_var", "backbone.body.6.block.2.fc1.weight", "backbone.body.6.block.2.fc1.bias", "backbone.body.6.block.2.fc2.weight", "backbone.body.6.block.2.fc2.bias", "backbone.body.6.block.3.0.weight", "backbone.body.6.block.3.1.weight", "backbone.body.6.block.3.1.bias", "backbone.body.6.block.3.1.running_mean", "backbone.body.6.block.3.1.running_var", "backbone.body.7.block.0.0.weight", "backbone.body.7.block.0.1.weight", "backbone.body.7.block.0.1.bias", "backbone.body.7.block.0.1.running_mean", "backbone.body.7.block.0.1.running_var", "backbone.body.7.block.1.0.weight", "backbone.body.7.block.1.1.weight", "backbone.body.7.block.1.1.bias", "backbone.body.7.block.1.1.running_mean", "backbone.body.7.block.1.1.running_var", "backbone.body.7.block.2.0.weight", "backbone.body.7.block.2.1.weight", "backbone.body.7.block.2.1.bias", "backbone.body.7.block.2.1.running_mean", "backbone.body.7.block.2.1.running_var", "backbone.body.8.block.0.0.weight", "backbone.body.8.block.0.1.weight", "backbone.body.8.block.0.1.bias", "backbone.body.8.block.0.1.running_mean", "backbone.body.8.block.0.1.running_var", "backbone.body.8.block.1.0.weight", "backbone.body.8.block.1.1.weight", "backbone.body.8.block.1.1.bias", "backbone.body.8.block.1.1.running_mean", "backbone.body.8.block.1.1.running_var", "backbone.body.8.block.2.0.weight", "backbone.body.8.block.2.1.weight", "backbone.body.8.block.2.1.bias", "backbone.body.8.block.2.1.running_mean", "backbone.body.8.block.2.1.running_var", "backbone.body.9.block.0.0.weight", "backbone.body.9.block.0.1.weight", "backbone.body.9.block.0.1.bias", "backbone.body.9.block.0.1.running_mean", "backbone.body.9.block.0.1.running_var", "backbone.body.9.block.1.0.weight", "backbone.body.9.block.1.1.weight", "backbone.body.9.block.1.1.bias", "backbone.body.9.block.1.1.running_mean", "backbone.body.9.block.1.1.running_var", "backbone.body.9.block.2.0.weight", "backbone.body.9.block.2.1.weight", "backbone.body.9.block.2.1.bias", "backbone.body.9.block.2.1.running_mean", "backbone.body.9.block.2.1.running_var", "backbone.body.10.block.0.0.weight", "backbone.body.10.block.0.1.weight", "backbone.body.10.block.0.1.bias", "backbone.body.10.block.0.1.running_mean", "backbone.body.10.block.0.1.running_var", "backbone.body.10.block.1.0.weight", "backbone.body.10.block.1.1.weight", "backbone.body.10.block.1.1.bias", "backbone.body.10.block.1.1.running_mean", "backbone.body.10.block.1.1.running_var", "backbone.body.10.block.2.0.weight", "backbone.body.10.block.2.1.weight", "backbone.body.10.block.2.1.bias", "backbone.body.10.block.2.1.running_mean", "backbone.body.10.block.2.1.running_var", "backbone.body.11.block.0.0.weight", "backbone.body.11.block.0.1.weight", "backbone.body.11.block.0.1.bias", "backbone.body.11.block.0.1.running_mean", "backbone.body.11.block.0.1.running_var", "backbone.body.11.block.1.0.weight", "backbone.body.11.block.1.1.weight", "backbone.body.11.block.1.1.bias", "backbone.body.11.block.1.1.running_mean", "backbone.body.11.block.1.1.running_var", "backbone.body.11.block.2.fc1.weight", "backbone.body.11.block.2.fc1.bias", "backbone.body.11.block.2.fc2.weight", "backbone.body.11.block.2.fc2.bias", "backbone.body.11.block.3.0.weight", "backbone.body.11.block.3.1.weight", "backbone.body.11.block.3.1.bias", "backbone.body.11.block.3.1.running_mean", "backbone.body.11.block.3.1.running_var", "backbone.body.12.block.0.0.weight", "backbone.body.12.block.0.1.weight", "backbone.body.12.block.0.1.bias", "backbone.body.12.block.0.1.running_mean", "backbone.body.12.block.0.1.running_var", "backbone.body.12.block.1.0.weight", "backbone.body.12.block.1.1.weight", "backbone.body.12.block.1.1.bias", "backbone.body.12.block.1.1.running_mean", "backbone.body.12.block.1.1.running_var", "backbone.body.12.block.2.fc1.weight", "backbone.body.12.block.2.fc1.bias", "backbone.body.12.block.2.fc2.weight", "backbone.body.12.block.2.fc2.bias", "backbone.body.12.block.3.0.weight", "backbone.body.12.block.3.1.weight", "backbone.body.12.block.3.1.bias", "backbone.body.12.block.3.1.running_mean", "backbone.body.12.block.3.1.running_var", "backbone.body.13.block.0.0.weight", "backbone.body.13.block.0.1.weight", "backbone.body.13.block.0.1.bias", "backbone.body.13.block.0.1.running_mean", "backbone.body.13.block.0.1.running_var", "backbone.body.13.block.1.0.weight", "backbone.body.13.block.1.1.weight", "backbone.body.13.block.1.1.bias", "backbone.body.13.block.1.1.running_mean", "backbone.body.13.block.1.1.running_var", "backbone.body.13.block.2.fc1.weight", "backbone.body.13.block.2.fc1.bias", "backbone.body.13.block.2.fc2.weight", "backbone.body.13.block.2.fc2.bias", "backbone.body.13.block.3.0.weight", "backbone.body.13.block.3.1.weight", "backbone.body.13.block.3.1.bias", "backbone.body.13.block.3.1.running_mean", "backbone.body.13.block.3.1.running_var", "backbone.body.14.block.0.0.weight", "backbone.body.14.block.0.1.weight", "backbone.body.14.block.0.1.bias", "backbone.body.14.block.0.1.running_mean", "backbone.body.14.block.0.1.running_var", "backbone.body.14.block.1.0.weight", "backbone.body.14.block.1.1.weight", "backbone.body.14.block.1.1.bias", "backbone.body.14.block.1.1.running_mean", "backbone.body.14.block.1.1.running_var", "backbone.body.14.block.2.fc1.weight", "backbone.body.14.block.2.fc1.bias", "backbone.body.14.block.2.fc2.weight", "backbone.body.14.block.2.fc2.bias", "backbone.body.14.block.3.0.weight", "backbone.body.14.block.3.1.weight", "backbone.body.14.block.3.1.bias", "backbone.body.14.block.3.1.running_mean", "backbone.body.14.block.3.1.running_var", "backbone.body.15.block.0.0.weight", "backbone.body.15.block.0.1.weight", "backbone.body.15.block.0.1.bias", "backbone.body.15.block.0.1.running_mean", "backbone.body.15.block.0.1.running_var", "backbone.body.15.block.1.0.weight", "backbone.body.15.block.1.1.weight", "backbone.body.15.block.1.1.bias", "backbone.body.15.block.1.1.running_mean", "backbone.body.15.block.1.1.running_var", "backbone.body.15.block.2.fc1.weight", "backbone.body.15.block.2.fc1.bias", "backbone.body.15.block.2.fc2.weight", "backbone.body.15.block.2.fc2.bias", "backbone.body.15.block.3.0.weight", "backbone.body.15.block.3.1.weight", "backbone.body.15.block.3.1.bias", "backbone.body.15.block.3.1.running_mean", "backbone.body.15.block.3.1.running_var", "backbone.body.16.0.weight", "backbone.body.16.1.weight", "backbone.body.16.1.bias", "backbone.body.16.1.running_mean", "backbone.body.16.1.running_var", "backbone.fpn.inner_blocks.0.0.weight", "backbone.fpn.inner_blocks.0.0.bias", "backbone.fpn.inner_blocks.1.0.weight", "backbone.fpn.inner_blocks.1.0.bias", "backbone.fpn.layer_blocks.0.0.weight", "backbone.fpn.layer_blocks.0.0.bias", "backbone.fpn.layer_blocks.1.0.weight", "backbone.fpn.layer_blocks.1.0.bias".

Installation instructions

Installed Torchserve using,

python ./ts_scripts/install_dependencies.py --cuda=cu114 pip install torchserve torch-model-archiver torch-workflow-archiver

Model Packaing

Built a model.py for SSDLite using https://github.com/pytorch/serve/blob/master/examples/object_detector/fast-rcnn/model.py as reference

from torchvision.models.detection.ssd import SSD
from torchvision.models.detection.ssdlite import SSDLiteHead
from torchvision.models.detection.backbone_utils import mobilenet_backbone
from torchvision.models.detection.anchor_utils import DefaultBoxGenerator
import torchvision.models.detection._utils as det_utils

from torch import nn

from functools import partial

class SSDLiteObjectDetector(SSD):
    def __init__(self, num_classes=91, **kwargs):
        backbone = mobilenet_backbone('mobilenet_v3_large', True, True)
        size = (320, 320)

        out_channels = det_utils.retrieve_out_channels(backbone, size)

        anchor_generator = DefaultBoxGenerator([[2, 3] for _ in range(6)], min_ratio=0.2, max_ratio=0.95)
        num_anchors = anchor_generator.num_anchors_per_location()

        norm_layer = partial(nn.BatchNorm2d, eps=0.001, momentum=0.03)
        head = SSDLiteHead(out_channels, num_anchors, num_classes, norm_layer)

        super(SSDLiteObjectDetector, self).__init__(backbone=backbone, anchor_generator=anchor_generator, num_classes=num_classes, size=size, head=head, **kwargs)

config.properties

None

Versions


Environment headers

Torchserve branch:

torchserve==0.7.0 torch-model-archiver==0.7.0

Python version: 3.9 (64-bit runtime) Python executable: /home/chandan/anaconda3/envs/torch-stream/bin/python

Versions of relevant python libraries: captum==0.5.0 future==0.18.2 numpy==1.24.0 nvgpu==0.9.0 psutil==5.9.4 requests==2.28.1 torch==1.13.1 torch-model-archiver==0.7.0 torch-workflow-archiver==0.2.6 torchaudio==0.13.1 torchserve==0.7.0 torchvision==0.14.1 wheel==0.38.4 torch==1.13.1 **Warning: torchtext not present .. torchvision==0.14.1 torchaudio==0.13.1

Java Version:

OS: Ubuntu 18.04.6 LTS GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: N/A CMake version: version 3.10.2

Is CUDA available: Yes CUDA runtime version: N/A GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1050 Nvidia driver version: 470.161.03 cuDNN version: None

Repro instructions

Downloaded the weights of the model from https://download.pytorch.org/models/ssdlite320_mobilenet_v3_large_coco-a79551df.pth.

Created the .mar file using the model.py file given above, copied the created .mar file to model_store

torch-model-archiver --model-name ssdlitemobilenet --version 1.0  --serialized-file ssdlite/ssdlite320_mobilenet_v3_large_coco-a79551df.pth  --model-file ssdlite/model.py  --extra-files ssdlite/index_to_name.json --handler object_detector

Ran the server with

torchserve --start --model-store model_store --models ssdlitemobilenet=ssdlitemobilenet.mar

Possible Solution

I am not sure why the weights aren't compatible with the SSDlite architecture. My guess is that I am not building the model skeleton right in model.py. The docs don't specify creating handlers for different object detection models except for FastRCNN which I used as a reference. Any help is really appreciated to correct this issue.

agunapal commented 1 year ago

@chandan-labelfuse This seems like a mismatch between your model and weights. Did you try loading the weights into your model with standalone PyTorch

chandan-labelfuse commented 1 year ago

@agunapal I have loaded the weights using this script and downloaded the same weights to compile in torch-model-archiver.

from torchvision.models.detection import ssdlite320_mobilenet_v3_large
from torchvision.models.detection import SSDLite320_MobileNet_V3_Large_Weights

weights = SSDLite320_MobileNet_V3_Large_Weights.DEFAULT
model = ssdlite320_mobilenet_v3_large(weights=weights)

I am skeptical about my definition of SSDLite in model.py. Could one of you verify the implementation? Comparted to FasterCNN docs, SSD is a bit more complicated since it requires more parameters to be passed.

agunapal commented 1 year ago

Will get back to you