pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.21k stars 858 forks source link

TorchServe slower (and lower GPU utilization) than Eager inference #1120

Closed mspinaci closed 2 years ago

mspinaci commented 3 years ago

While running a (rather) big model on GPU on high res images I can easily achieve 100% GPU utilization in Eager mode. Porting the same model to TorchServe (and removing all CPU pre/post processing) still fails to achieve it, and runtime is almost twice as much.

(I'm relatively new to Pytorch and just started with TorchServe, so this might well be a problem on my end and not a bug, if so I apologize in advance!)

Context

Your Environment

Expected Behavior

I'll paste some code below of how I achieved the reported results in Eager mode and in TorchServe. Unfortunately I cannot paste a replicable test because it's a large model, with a lot of code in it, and the code is non-public and proprietary for the company I work in...

So the assumption is I have a model.py file that contains an EncoderDecoder class (subclass of nn.Module). Its weights are saved in weights.pth. The inference code simply loads images from the images folder, rescales them to the high res format we use (about 8 megapixel), feeds to the model (using multiple CPU workers), and then some bounding box postprocessing is applied:

import os
import torch
import numpy as np
from tqdm import tqdm
from PIL import Image
from torch.utils.data import Dataset, DataLoader

from model import EncoderDecoder
from handler import torch_extract_boxes, resize_and_pad

class ImageDataset(Dataset):
    def __init__(self, path='images/',
                 extension='.jpg',
                 input_res=(3296, 2544)):
        self.filenames = [f for f in os.listdir(path) if f.endswith(extension)]
        self.path = path
        self.input_res = input_res

    def __len__(self):
        return len(self.filenames)

    def __getitem__(self, item):
        filename = self.filenames[item]
        image = Image.open(os.path.join(self.path, filename))
        return torch.from_numpy(resize_and_pad(image.convert('L'), self.input_res)[0][None])

def main(output_res=(824, 636)):
    dataset = ImageDataset()
    data_loader = DataLoader(dataset, num_workers=4, pin_memory=True)
    model = EncoderDecoder()
    model.eval()
    model.cuda(0)
    checkpoint = torch.load('weights.pth')
    model.load_state_dict(checkpoint)
    device = torch.device('cuda')

    with torch.no_grad():
        for image in tqdm(data_loader):
            preds = model(image.to(device))
            result = torch_extract_boxes(preds, output_res).cpu()[None].detach().numpy()

if __name__ == '__main__':
    main()

This achieves the expected behaviour: GPU is virtually always at 100% utilization (see magenta line below plotted with nvtop), and the loop takes 18 seconds for 50 images (of which a couple of seconds go into warming up the GPU):

Screen Shot 2021-06-09 at 14 21 59 Screen Shot 2021-06-09 at 14 20 38

Current Behavior

This does not happen when using TorchServe. I had to customize a few things to allow it to run (in theory) pure GPU processes, i.e. the input is the already processed image as a (bytearray representing) a numpy array, and the output is the raw bytes of the numpy array output. To achieve this, I archived the model running

torch-model-archiver --model-name model --version 1.0 \
--model-file model.py \
--serialized-file weights.pth \
--handler  handler.py

mkdir -p model_store
mv model.mar model_store/

where handler.py is:

import torch
from math import log
import numpy as np
from PIL import Image
import base64

from ts.torch_handler.vision_handler import BaseHandler

class ModelHandler(BaseHandler):
    input_res = (3296, 2544)
    output_res = (824, 636)

    def preprocess(self, data):
        images = []

        for row in data:
            # Compat layer: normally the envelope should just return the data
            # directly, but older versions of Torchserve didn't have envelope.
            image = row.get("data") or row.get("body")
            if isinstance(image, str):
                # if the image is a string of bytesarray.
                image = base64.b64decode(image)

            assert isinstance(image, (bytearray, bytes)), f'Type of image {type(image)} not understood'

            linearized_image = np.frombuffer(image, dtype=np.float32)

            # Resize to image format by also adding channel dimension
            images.append(linearized_image.reshape((1,) + self.input_res))

        return torch.stack(images).to(self.device)

    def postprocess(self, data):
        result = torch_extract_boxes(data, self.output_res).cpu()[None].detach().numpy()
        return [r.tobytes() for r in result]

def torch_extract_boxes(preds, output_res):
    pass
    # Omitted as not relevant

def resize_and_pad(image, input_res):
    ratio = min(input_res[0] / image.height, input_res[1] / image.width)
    new_size = (int(image.width * ratio), int(image.height * ratio))
    img_cv = np.asarray(image.resize(new_size, resample=Image.BILINEAR))
    img_cv = img_cv.astype(np.float32) / 255.0
    pad_y = input_res[0] - new_size[1]
    pad_x = input_res[1] - new_size[0]
    padded_image = np.pad(img_cv, [[0, pad_y], [0, pad_x]])
    return padded_image, new_size, ratio

I can then start TorchServe with:

CUDA_VISIBLE_DEVICES=7 torchserve --start \
--model-store model_store \
--models model.mar \
--ts-config config.properties > log.txt

where config.properties is edited to specify only one GPU and large input files:

number_of_gpu=1
max_request_size=50000000

Finally, I call the model on the images using the following script:

import requests
import os
from PIL import Image
import numpy as np
from tqdm import tqdm

from .handler import resize_and_pad

def main(path='images/',
         extension='.jpg',
         input_res=(3296, 2544),
         url='http://127.0.0.1:8080/predictions/model'):
    files = [f for f in os.listdir(path) if f.endswith(extension)]
    numpy_images = []
    print('Loading and converting images...')
    for f in tqdm(files):
        image = Image.open(os.path.join(path, f))
        numpy_images.append(resize_and_pad(image.convert('L'), input_res)[0])

    print('Issuing requests')
    results = []
    for np_image in tqdm(numpy_images):
        results.append(requests.post(url, data=np_image.tobytes()))

    print('Converting results')
    numpy_results = []
    for res in tqdm(results):
        x = np.frombuffer(res.content, dtype=np.float32).reshape(-1, 15)
        print(x.shape)
        numpy_results.append(x)

if __name__ == '__main__':
    main()

This results in alternated utilization of GPU and inference taking 30s (almost twice as slow as above) just for the GPU part (I do all the CPU pre/postprocessing on different loops):

Screen Shot 2021-06-09 at 14 18 19 Screen Shot 2021-06-09 at 14 18 10

Final remarks

In earlier versions I tried some "less custom" handlers, namely to transform the numpy output to a list of float, which is then handled by TorchServe, and to pass the image bytes as input subclassing VisionHandler. But: 1) For the output that was a bad idea and I changed soon, because the transformations numpy -> list -> string were extremely slow for the large output I had (around 10k rows and 15 columns). 2) For Image, that handed the JPEG decoding + creation of numpy array to TorchServe, and given that everything happens sequentially that might be a bad idea. In this version, instead, everything is passed as low-level as possible, which does result in a 32MB (= 329625444 bytes) of string being sent via POST, but given it's all localhost there's no network issue. Output is also rather large (around 1MB, varies from image to image), but also with the low-level encoding over localhost that shouldn't be an issue.

Summing up

Is TorchServe just slower (e.g. due to the large file size), or have I done something stupid above? Thanks in advance for any hint!

juyunsang commented 3 years ago

@mspinaci Did you solve this?

mspinaci commented 3 years ago

Hello @juyunsang ,

Not really, but at least I have some theories. In my project I ended up using Eager mode behind a Flask server, which worked fine and had also a lot less headaches than TorchServe to export / import (e.g. not asking me to create a model without parameters inside a file with no other instances of nn.Module, that was really really annoying). Plus it makes production and training closer, which I also prefer.

Anyway, with some more digging, I think at least part of the slowdown might come from torchserve being slow to process large input files. It's a somehow funny because I had the very same problem with the Flask implementation if I used GUnicorn or Waitress (for those interested, see here: https://stackoverflow.com/questions/67938278/waitress-and-gunicorn-large-data-input-is-much-slower-than-flask-development-ser/68079761)

Here is a fully reproducible test that you can run to check the same:

model.py:

import torch
from torch import nn

SMALL_SIZE_INPUT = True
INPUT_SIZE = (3000, 3000)

class Model(nn.Module):
    def __init__(self):
        super().__init__()

        if SMALL_SIZE_INPUT:
            self.upsample = nn.Upsample(size=INPUT_SIZE)
        else:
            self.upsample = nn.Identity()

        self.conv = nn.Conv2d(
            in_channels=1,
            out_channels=1,
            kernel_size=3,
            padding=1
        )

    def forward(self, x):
        x = self.upsample(x)
        x = self.conv(x)
        return torch.mean(x)

time_eager.py:

import torch
import numpy as np
from tqdm import trange

from model import Model, SMALL_SIZE_INPUT, INPUT_SIZE

model = Model()
model.eval()
model.cuda()
device = torch.device('cuda')
# device = torch.device('cpu')
if not SMALL_SIZE_INPUT:
    x = np.random.randn(1, 1, INPUT_SIZE[0], INPUT_SIZE[1]).astype(np.float32)
else:
    x = np.random.randn(1, 1, 10, 10).astype(np.float32)
# Warm up
model(torch.from_numpy(x).to(device))

torch.cuda.synchronize()
for _ in trange(100):
    model(torch.from_numpy(x).to(device));
    torch.cuda.synchronize()

torch.save(model.state_dict(), 'weights.pth')

Running the above code in Eager mode I got the following timings:

The difference between small and large size comes from transforming a big numpy array to torch tensor and sending it to GPU, which is expected (if I do that just once outside of the loop, it ends up being even faster at 670 it/s, I guess because it saves time not doing the Upsample step)

Now, I try to do the same on TorchServe. To do so, I create handler.py:

import torch
from math import log
import numpy as np
from PIL import Image
import base64

from ts.torch_handler.vision_handler import BaseHandler

class ModelHandler(BaseHandler):
    input_res = (3000, 3000)

    def preprocess(self, data):
        images = []

        for row in data:
            # Compat layer: normally the envelope should just return the data
            # directly, but older versions of Torchserve didn't have envelope.
            image = row.get("data") or row.get("body")
            if isinstance(image, str):
                # if the image is a string of bytesarray.
                image = base64.b64decode(image)

            assert isinstance(image, (bytearray, bytes)), f'Type of image {type(image)} not understood'

            linearized_image = np.frombuffer(image, dtype=np.float32)

            # Resize to image format by also adding channel dimension
            images.append(linearized_image.reshape((1,) + self.input_res))

        return torch.stack(images).to(self.device)

and export the model with

torch-model-archiver --model-name model --version 1.0 \
--model-file model.py \
--serialized-file weights.pth \
--handler handler.py

mkdir model_store
mv *.mar model_store

Like above, I create config.properties including:

number_of_gpu=1
max_request_size=50000000

and start the server via:

CUDA_VISIBLE_DEVICES=0 torchserve --start \
--model-store model_store \
--models model.mar \
--ts-config config.properties > log.txt

When I then query this with the following request.py script:

import requests
import os
import numpy as np
from tqdm import trange

from model import INPUT_SIZE, SMALL_SIZE_INPUT

def main(url='http://127.0.0.1:8080/predictions/model'):
    if not SMALL_SIZE_INPUT:
        x = np.random.randn(1, 1, INPUT_SIZE[0], INPUT_SIZE[1])
    else:
        x = np.random.randn(1, 1, 10, 10)

    # Warm-up request
    requests.post(url, data=x.tobytes())

    print('Issuing requests')
    results = []
    for _ in trange(100):
        results.append(requests.post(url, data=x.tobytes()))

if __name__ == '__main__':
    main()

what I get are the following measurements:

Needless to say, when running with large data at a meagre ~ 8 it/s, GPU utilization is basically 0%. I suppose all the time is spent just reading the incoming data by whatever server TorchServe is using to wait for data (it doesn't seem to be the conversion numpy array -> torch tensor -> GPU, because that also happens in Eager mode). For GUnicorn the solution proposed by the person who helped out on the stackoverflow above was to increase the amount of data copied to buffer in one go (by default GUnicorn copies 1kb of incoming data per step in a while loop, so in this case it becomes a big while loop with about 30k steps), for TorchServe I don't know :-)

chauhang commented 3 years ago

@mspinaci Can you please share details of the model and inputs sizes:

HamidShojanazeri commented 3 years ago

@mspinaci Thanks for opening this ticket and sorry for the delay, I am looking into this issue and was wondering if you could share some more details.

mspinaci commented 3 years ago

Hello @chauhang and @HamidShojanazeri , sorry for the late answer, I just came back from vacation. Please find below answer to your questions (for details, see the fully reproducible example I posted a couple of comments up...)

HamidShojanazeri commented 3 years ago

@mspinaci Thanks for the explanations, I agree with you that input size matter the most in this case, unfortunately, I couldn't repro using your example. In case of using the bigger input as provided in your example, it actually is not going through (it shows 8 iter/s as a result of tqdm), you can check this in the access.log, model_log.log and ts_log.log, when the small input is selected, it gets some error in preprocessing of the base-handler and the error get reflected in the model_log.log, . I am looking into this issue/ try to repro, it would be great if you could share what you get as the result of running the above example?

-- update if you replace the request.post with curl command as below:

for _ in trange(10):
        os.system("curl http://127.0.0.1:8080/predictions/model --data-binary x")

although it still fails at the preprocessing "linearized_image = np.frombuffer(image, dtype=np.float32)" but the iteration/s goes up to 72 in our tests.

Logs https://gist.github.com/HamidShojanazeri/ea0320fc3e4be3cb93ca580ebf64d58e

Can you please run some tests (with the working preprocessing step) to make sure that data is passed to the backend (through checking the logs / access/model/ts logs) and let us know what number you are seeing.

HamidShojanazeri commented 3 years ago

@mspinaci there is also another workaround, that instead of sending the large payloads over http, you can read it in the handler from a shared location (e.g s3 bucket), please let us know if that can work for your use-case.