Postprocess bottleneck in model_handler.py

sangkyuleeKOR commented 2 years ago

Please have a look at FAQ's and Troubleshooting guide, your query may be already addressed.

Your issue may already be reported! Please search on the issue tracker before creating one.

Context

torchserve version: 0.5.1
torch-model-archiver version: built from dockerfile
torch version: 1.10.0
torchvision version [if any]: 0.11.1
torchtext version [if any]: 0.11.0
torchaudio version [if any]: 0.10.0
java version: -
Operating System and version: ubuntu 18.04.5 LTS (Docker)

Your Environment

Installed using source? [yes/no]: no
Are you planning to deploy it using docker container? [yes/no]: yes
Is it a CPU or GPU environment?: GPU
Using a default/custom handler? [If possible upload/share custom handler/model]: custom handler
What kind of model is it e.g. vision, text, audio?: vision
Are you planning to use local models from model-store or public url being used e.g. from S3 bucket etc.? local [If public url then provide link.]:

Provide config.properties, logs [ts.log] and parameters used for model registration/update APIs:

inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
number_of_netty_threads=32
job_queue_size=1000
model_store=/home/model-server/model-store
install_py_dep_per_model=True
load_models=resnet101.mar

Link to your project [if any]:

Expected Behavior

I'm trying to serve pytorch model using torchserve. I created custom model handler and hear is the code below.

# custom handler file

# model_handler.py

"""
ModelHandler defines a custom model handler.
"""
import os
import cv2
import time
import torch
import base64
import numpy as np
from PIL import Image
from model import Classifier
from ts.torch_handler.base_handler import BaseHandler

class ModelHandler(BaseHandler):
    """
    A custom model handler implementation.
    """

    def __init__(self):
        self._context = None
        self.initialized = False
        self.explain = False
        self.model = None
        self.target = 0

    def initialize(self, context):
        """
        Initialize model. This will be called during model loading time
        :param context: Initial context contains model server system properties.
        :return:
        """
        properties = context.system_properties
        model_dir = properties.get("model_dir")
        self._context = context

        if not torch.cuda.is_available() or properties.get("gpu_id") is None :
            raise RuntimeError("This model is not supported on CPU machines.")

        self.device = torch.device("cuda:" + str(properties.get("gpu_id")))
        self.model = self._load_model(model_dir).to(self.device)
        torch.cuda.synchronize()
        self.initialized = True

    def _load_model(self, model_dir):
        self.model = Classifier(model_name='Unet')
        self.model = self.model.load_from_checkpoint(os.path.join(model_dir, 'dot_model.ckpt'))
        # self.model = torch.jit.trace(self.model, torch.rand(1, 3, 320, 320))
        return self.model.eval()

    def preprocess(self, data):
        """
        Transform raw input into model input data.
        :param batch: list of raw requests, should match batch size
        :return: list of preprocessed model input data
        """
        # Take the input data and make it inference ready
        bytes = data[0].get("data")
        if bytes is None:
            bytes = data[0].get("body")
        d_imgs = base64.decodebytes(bytes)
        img_array = np.frombuffer(d_imgs, dtype=np.uint8).reshape((6,80,320,3))
        images = np.empty((6,320,320,3))
        for i, img in enumerate(img_array):
            img = np.resize(img, (320,320,3)) / 255.
            images[i,:] = img
        images = torch.tensor(images, dtype=torch.float32, device='cuda')
        images = images.permute(0,3,1,2)
        return images

    def inference(self, model_input):
        """
        Internal inference methods
        :param model_input: transformed model input data
        :return: list of inference output in NDArray
        """
        # Do some inference call to engine here and return output
        with torch.no_grad():
            model_output = self.model.forward(model_input)

        return model_output

    def postprocess(self, inference_output):
        """
        Return inference result.
        :param inference_output: list of inference output
        :return: list of predict results
        """
        # Take output from network and post-process to desired format
        final = time.time()
        torch.cuda.synchronize()
        print("sync-time:", time.time()-final)

        s = time.time()
        masks = inference_output.squeeze()
        print("squeezing time:",time.time()-s)

        s = time.time()
        masks = masks.detach()
        print("detaching time:",time.time()-s)

        s = time.time()
        masks = masks.cpu()
        print("cpu time:",time.time()-s)

        s = time.time()
        masks = masks.numpy()
        print("numpying time:",time.time()-s)

        s = time.time()
        masks = np.where(masks >= 0.90, 0, 1)
        pixels = np.count_nonzero(masks, axis=(1,2))
        results = pixels < 2
        results = results.tolist()
        print("resulting-time:", time.time()-s)

        print("total-time:", time.time()-final)

        return results

    def handle(self, data, context):
        """
        Invoke by TorchServe for prediction request.
        Do pre-processing of data, prediction using model and postprocessing of prediciton output
        :param data: Input data for prediction
        :param context: Initial context contains model server system properties.
        :return: prediction output
        """
        s = time.time()
        model_input = self.preprocess(data)
        # print("preprocess time:",time.time()-s)
        s = time.time()
        model_output = self.inference(model_input)
        # print("inference time:",time.time()-s)
        s = time.time()
        return_value = self.postprocess(model_output)
        print("postprocess time:",time.time()-s)
        return [return_value]

Current Behavior

But, When I send a request to torchserve, Postprocess has huge amount of time. so i tried to figure out which part is problem, I tried "torch.cuda.synchronize()" it took a lot of time to do, so I delete it. after that, torch.tensor to "cpu" time increase. cpu sync

I don't know why postprocess took a lot of time. In my opinion something has wrong in my setting. Please let me now If i'm wrong.

Possible Solution

Steps to Reproduce

1. 2. ...

Failure Logs [if any]

braindotai commented 2 years ago

Can you try time.perf_counter() instead of time.time()?

braindotai commented 2 years ago

Moving to CPU is going to potentially require synchronization as most GPU operations are asynchronous. So it depends on whether the GPU result is actually there and the time will often be due to other operations that aren’t yet completed.

sangkyuleeKOR commented 2 years ago

@braindotai Thanks for replying!! I tried time.perf_counter() instead of time.time(). then I found GPU to CPU time takes 0.01sec. your second opinion seems to correct. Is there any way to solve this synchronization problem??? should I wait until another gpu operation finished?

arnavmehta7 commented 1 year ago

I am also having same issue, postprocess takes a lot of time, the only thing it does is just uploading :(

pytorch / serve