pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.14k stars 835 forks source link

Postprocess bottleneck in model_handler.py #1382

Open sangkyuleeKOR opened 2 years ago

sangkyuleeKOR commented 2 years ago

Please have a look at FAQ's and Troubleshooting guide, your query may be already addressed.

Your issue may already be reported! Please search on the issue tracker before creating one.

Context

Your Environment

Expected Behavior

I'm trying to serve pytorch model using torchserve. I created custom model handler and hear is the code below.

# custom handler file

# model_handler.py

"""
ModelHandler defines a custom model handler.
"""
import os
import cv2
import time
import torch
import base64
import numpy as np
from PIL import Image
from model import Classifier
from ts.torch_handler.base_handler import BaseHandler

class ModelHandler(BaseHandler):
    """
    A custom model handler implementation.
    """

    def __init__(self):
        self._context = None
        self.initialized = False
        self.explain = False
        self.model = None
        self.target = 0

    def initialize(self, context):
        """
        Initialize model. This will be called during model loading time
        :param context: Initial context contains model server system properties.
        :return:
        """
        properties = context.system_properties
        model_dir = properties.get("model_dir")
        self._context = context

        if not torch.cuda.is_available() or properties.get("gpu_id") is None :
            raise RuntimeError("This model is not supported on CPU machines.")

        self.device = torch.device("cuda:" + str(properties.get("gpu_id")))
        self.model = self._load_model(model_dir).to(self.device)
        torch.cuda.synchronize()
        self.initialized = True

    def _load_model(self, model_dir):
        self.model = Classifier(model_name='Unet')
        self.model = self.model.load_from_checkpoint(os.path.join(model_dir, 'dot_model.ckpt'))
        # self.model = torch.jit.trace(self.model, torch.rand(1, 3, 320, 320))
        return self.model.eval()

    def preprocess(self, data):
        """
        Transform raw input into model input data.
        :param batch: list of raw requests, should match batch size
        :return: list of preprocessed model input data
        """
        # Take the input data and make it inference ready
        bytes = data[0].get("data")
        if bytes is None:
            bytes = data[0].get("body")
        d_imgs = base64.decodebytes(bytes)
        img_array = np.frombuffer(d_imgs, dtype=np.uint8).reshape((6,80,320,3))
        images = np.empty((6,320,320,3))
        for i, img in enumerate(img_array):
            img = np.resize(img, (320,320,3)) / 255.
            images[i,:] = img
        images = torch.tensor(images, dtype=torch.float32, device='cuda')
        images = images.permute(0,3,1,2)
        return images

    def inference(self, model_input):
        """
        Internal inference methods
        :param model_input: transformed model input data
        :return: list of inference output in NDArray
        """
        # Do some inference call to engine here and return output
        with torch.no_grad():
            model_output = self.model.forward(model_input)

        return model_output

    def postprocess(self, inference_output):
        """
        Return inference result.
        :param inference_output: list of inference output
        :return: list of predict results
        """
        # Take output from network and post-process to desired format
        final = time.time()
        torch.cuda.synchronize()
        print("sync-time:", time.time()-final)

        s = time.time()
        masks = inference_output.squeeze()
        print("squeezing time:",time.time()-s)

        s = time.time()
        masks = masks.detach()
        print("detaching time:",time.time()-s)

        s = time.time()
        masks = masks.cpu()
        print("cpu time:",time.time()-s)

        s = time.time()
        masks = masks.numpy()
        print("numpying time:",time.time()-s)

        s = time.time()
        masks = np.where(masks >= 0.90, 0, 1)
        pixels = np.count_nonzero(masks, axis=(1,2))
        results = pixels < 2
        results = results.tolist()
        print("resulting-time:", time.time()-s)

        print("total-time:", time.time()-final)

        return results

    def handle(self, data, context):
        """
        Invoke by TorchServe for prediction request.
        Do pre-processing of data, prediction using model and postprocessing of prediciton output
        :param data: Input data for prediction
        :param context: Initial context contains model server system properties.
        :return: prediction output
        """
        s = time.time()
        model_input = self.preprocess(data)
        # print("preprocess time:",time.time()-s)
        s = time.time()
        model_output = self.inference(model_input)
        # print("inference time:",time.time()-s)
        s = time.time()
        return_value = self.postprocess(model_output)
        print("postprocess time:",time.time()-s)
        return [return_value]

Current Behavior

But, When I send a request to torchserve, Postprocess has huge amount of time. so i tried to figure out which part is problem, I tried "torch.cuda.synchronize()" it took a lot of time to do, so I delete it. after that, torch.tensor to "cpu" time increase. cpu sync

I don't know why postprocess took a lot of time. In my opinion something has wrong in my setting. Please let me now If i'm wrong.

Possible Solution

Steps to Reproduce

1. 2. ...

Failure Logs [if any]

braindotai commented 2 years ago

Can you try time.perf_counter() instead of time.time()?

braindotai commented 2 years ago

Moving to CPU is going to potentially require synchronization as most GPU operations are asynchronous. So it depends on whether the GPU result is actually there and the time will often be due to other operations that aren’t yet completed.

sangkyuleeKOR commented 2 years ago

@braindotai Thanks for replying!! I tried time.perf_counter() instead of time.time(). then I found GPU to CPU time takes 0.01sec. your second opinion seems to correct. Is there any way to solve this synchronization problem??? should I wait until another gpu operation finished?

arnavmehta7 commented 1 year ago

I am also having same issue, postprocess takes a lot of time, the only thing it does is just uploading :(