TorchServe slower (and lower GPU utilization) than Eager inference

While running a (rather) big model on GPU on high res images I can easily achieve 100% GPU utilization in Eager mode. Porting the same model to TorchServe (and removing all CPU pre/post processing) still fails to achieve it, and runtime is almost twice as much.

(I'm relatively new to Pytorch and just started with TorchServe, so this might well be a problem on my end and not a bug, if so I apologize in advance!)

Context

torchserve version: 0.4.0
torch-model-archiver version: 0.4.0
torch version: 1.8.1+cu101
torchvision version [if any]: 0.9.1+cu101 (not used?)
java version: 11.0.11
Operating System and version: Ubuntu 18.04.3 LTS (in a custom Docker container)

Your Environment

Installed using source? [yes/no]: no
Are you planning to deploy it using docker container? [yes/no]: yes
Is it a CPU or GPU environment?: GPU
Using a default/custom handler? [If possible upload/share custom handler/model]: Custom, see below
What kind of model is it e.g. vision, text, audio?: vision
Are you planning to use local models from model-store or public url being used e.g. from S3 bucket etc.? [If public url then provide link.]: local model
Provide config.properties, logs [ts.log] and parameters used for model registration/update APIs: see below
Link to your project [if any]:

Expected Behavior

I'll paste some code below of how I achieved the reported results in Eager mode and in TorchServe. Unfortunately I cannot paste a replicable test because it's a large model, with a lot of code in it, and the code is non-public and proprietary for the company I work in...

So the assumption is I have a model.py file that contains an EncoderDecoder class (subclass of nn.Module). Its weights are saved in weights.pth. The inference code simply loads images from the images folder, rescales them to the high res format we use (about 8 megapixel), feeds to the model (using multiple CPU workers), and then some bounding box postprocessing is applied:

import os
import torch
import numpy as np
from tqdm import tqdm
from PIL import Image
from torch.utils.data import Dataset, DataLoader

from model import EncoderDecoder
from handler import torch_extract_boxes, resize_and_pad

class ImageDataset(Dataset):
    def __init__(self, path='images/',
                 extension='.jpg',
                 input_res=(3296, 2544)):
        self.filenames = [f for f in os.listdir(path) if f.endswith(extension)]
        self.path = path
        self.input_res = input_res

    def __len__(self):
        return len(self.filenames)

    def __getitem__(self, item):
        filename = self.filenames[item]
        image = Image.open(os.path.join(self.path, filename))
        return torch.from_numpy(resize_and_pad(image.convert('L'), self.input_res)[0][None])

def main(output_res=(824, 636)):
    dataset = ImageDataset()
    data_loader = DataLoader(dataset, num_workers=4, pin_memory=True)
    model = EncoderDecoder()
    model.eval()
    model.cuda(0)
    checkpoint = torch.load('weights.pth')
    model.load_state_dict(checkpoint)
    device = torch.device('cuda')

    with torch.no_grad():
        for image in tqdm(data_loader):
            preds = model(image.to(device))
            result = torch_extract_boxes(preds, output_res).cpu()[None].detach().numpy()

if __name__ == '__main__':
    main()

This achieves the expected behaviour: GPU is virtually always at 100% utilization (see magenta line below plotted with nvtop), and the loop takes 18 seconds for 50 images (of which a couple of seconds go into warming up the GPU):

Current Behavior

This does not happen when using TorchServe. I had to customize a few things to allow it to run (in theory) pure GPU processes, i.e. the input is the already processed image as a (bytearray representing) a numpy array, and the output is the raw bytes of the numpy array output. To achieve this, I archived the model running

torch-model-archiver --model-name model --version 1.0 \
--model-file model.py \
--serialized-file weights.pth \
--handler  handler.py

mkdir -p model_store
mv model.mar model_store/

where handler.py is:

import torch
from math import log
import numpy as np
from PIL import Image
import base64

from ts.torch_handler.vision_handler import BaseHandler

class ModelHandler(BaseHandler):
    input_res = (3296, 2544)
    output_res = (824, 636)

    def preprocess(self, data):
        images = []

        for row in data:
            # Compat layer: normally the envelope should just return the data
            # directly, but older versions of Torchserve didn't have envelope.
            image = row.get("data") or row.get("body")
            if isinstance(image, str):
                # if the image is a string of bytesarray.
                image = base64.b64decode(image)

            assert isinstance(image, (bytearray, bytes)), f'Type of image {type(image)} not understood'

            linearized_image = np.frombuffer(image, dtype=np.float32)

            # Resize to image format by also adding channel dimension
            images.append(linearized_image.reshape((1,) + self.input_res))

        return torch.stack(images).to(self.device)

    def postprocess(self, data):
        result = torch_extract_boxes(data, self.output_res).cpu()[None].detach().numpy()
        return [r.tobytes() for r in result]

def torch_extract_boxes(preds, output_res):
    pass
    # Omitted as not relevant

def resize_and_pad(image, input_res):
    ratio = min(input_res[0] / image.height, input_res[1] / image.width)
    new_size = (int(image.width * ratio), int(image.height * ratio))
    img_cv = np.asarray(image.resize(new_size, resample=Image.BILINEAR))
    img_cv = img_cv.astype(np.float32) / 255.0
    pad_y = input_res[0] - new_size[1]
    pad_x = input_res[1] - new_size[0]
    padded_image = np.pad(img_cv, [[0, pad_y], [0, pad_x]])
    return padded_image, new_size, ratio

I can then start TorchServe with:

CUDA_VISIBLE_DEVICES=7 torchserve --start \
--model-store model_store \
--models model.mar \
--ts-config config.properties > log.txt

where config.properties is edited to specify only one GPU and large input files:

number_of_gpu=1
max_request_size=50000000

Finally, I call the model on the images using the following script:

import requests
import os
from PIL import Image
import numpy as np
from tqdm import tqdm

from .handler import resize_and_pad

def main(path='images/',
         extension='.jpg',
         input_res=(3296, 2544),
         url='http://127.0.0.1:8080/predictions/model'):
    files = [f for f in os.listdir(path) if f.endswith(extension)]
    numpy_images = []
    print('Loading and converting images...')
    for f in tqdm(files):
        image = Image.open(os.path.join(path, f))
        numpy_images.append(resize_and_pad(image.convert('L'), input_res)[0])

    print('Issuing requests')
    results = []
    for np_image in tqdm(numpy_images):
        results.append(requests.post(url, data=np_image.tobytes()))

    print('Converting results')
    numpy_results = []
    for res in tqdm(results):
        x = np.frombuffer(res.content, dtype=np.float32).reshape(-1, 15)
        print(x.shape)
        numpy_results.append(x)

if __name__ == '__main__':
    main()

This results in alternated utilization of GPU and inference taking 30s (almost twice as slow as above) just for the GPU part (I do all the CPU pre/postprocessing on different loops):

Final remarks

In earlier versions I tried some "less custom" handlers, namely to transform the numpy output to a list of float, which is then handled by TorchServe, and to pass the image bytes as input subclassing VisionHandler. But: 1) For the output that was a bad idea and I changed soon, because the transformations numpy -> list -> string were extremely slow for the large output I had (around 10k rows and 15 columns). 2) For Image, that handed the JPEG decoding + creation of numpy array to TorchServe, and given that everything happens sequentially that might be a bad idea. In this version, instead, everything is passed as low-level as possible, which does result in a 32MB (= 329625444 bytes) of string being sent via POST, but given it's all localhost there's no network issue. Output is also rather large (around 1MB, varies from image to image), but also with the low-level encoding over localhost that shouldn't be an issue.

Summing up

Is TorchServe just slower (e.g. due to the large file size), or have I done something stupid above? Thanks in advance for any hint!

@mspinaci Did you solve this?

Hello @juyunsang ,

Not really, but at least I have some theories. In my project I ended up using Eager mode behind a Flask server, which worked fine and had also a lot less headaches than TorchServe to export / import (e.g. not asking me to create a model without parameters inside a file with no other instances of nn.Module, that was really really annoying). Plus it makes production and training closer, which I also prefer.

Anyway, with some more digging, I think at least part of the slowdown might come from torchserve being slow to process large input files. It's a somehow funny because I had the very same problem with the Flask implementation if I used GUnicorn or Waitress (for those interested, see here: https://stackoverflow.com/questions/67938278/waitress-and-gunicorn-large-data-input-is-much-slower-than-flask-development-ser/68079761)

Here is a fully reproducible test that you can run to check the same:

model.py:

import torch
from torch import nn

SMALL_SIZE_INPUT = True
INPUT_SIZE = (3000, 3000)

class Model(nn.Module):
    def __init__(self):
        super().__init__()

        if SMALL_SIZE_INPUT:
            self.upsample = nn.Upsample(size=INPUT_SIZE)
        else:
            self.upsample = nn.Identity()

        self.conv = nn.Conv2d(
            in_channels=1,
            out_channels=1,
            kernel_size=3,
            padding=1
        )

    def forward(self, x):
        x = self.upsample(x)
        x = self.conv(x)
        return torch.mean(x)

time_eager.py:

import torch
import numpy as np
from tqdm import trange

from model import Model, SMALL_SIZE_INPUT, INPUT_SIZE

model = Model()
model.eval()
model.cuda()
device = torch.device('cuda')
# device = torch.device('cpu')
if not SMALL_SIZE_INPUT:
    x = np.random.randn(1, 1, INPUT_SIZE[0], INPUT_SIZE[1]).astype(np.float32)
else:
    x = np.random.randn(1, 1, 10, 10).astype(np.float32)
# Warm up
model(torch.from_numpy(x).to(device))

torch.cuda.synchronize()
for _ in trange(100):
    model(torch.from_numpy(x).to(device));
    torch.cuda.synchronize()

torch.save(model.state_dict(), 'weights.pth')

Running the above code in Eager mode I got the following timings:

Running on V100 GPU and SMALL_SIZE_INPUT = True, around 570 it/s
... and SMALL_SIZE_INPUT = False, around 170 it/s
For reference, running on CPU I got around 12 it/s with either setting of the SMALL_SIZE_INPUT flag (I guess convolution becomes the bottleneck).

The difference between small and large size comes from transforming a big numpy array to torch tensor and sending it to GPU, which is expected (if I do that just once outside of the loop, it ends up being even faster at 670 it/s, I guess because it saves time not doing the Upsample step)

Now, I try to do the same on TorchServe. To do so, I create handler.py:

import torch
from math import log
import numpy as np
from PIL import Image
import base64

from ts.torch_handler.vision_handler import BaseHandler

class ModelHandler(BaseHandler):
    input_res = (3000, 3000)

    def preprocess(self, data):
        images = []

        for row in data:
            # Compat layer: normally the envelope should just return the data
            # directly, but older versions of Torchserve didn't have envelope.
            image = row.get("data") or row.get("body")
            if isinstance(image, str):
                # if the image is a string of bytesarray.
                image = base64.b64decode(image)

            assert isinstance(image, (bytearray, bytes)), f'Type of image {type(image)} not understood'

            linearized_image = np.frombuffer(image, dtype=np.float32)

            # Resize to image format by also adding channel dimension
            images.append(linearized_image.reshape((1,) + self.input_res))

        return torch.stack(images).to(self.device)

and export the model with

torch-model-archiver --model-name model --version 1.0 \
--model-file model.py \
--serialized-file weights.pth \
--handler handler.py

mkdir model_store
mv *.mar model_store

Like above, I create config.properties including:

number_of_gpu=1
max_request_size=50000000

and start the server via:

CUDA_VISIBLE_DEVICES=0 torchserve --start \
--model-store model_store \
--models model.mar \
--ts-config config.properties > log.txt

When I then query this with the following request.py script:

import requests
import os
import numpy as np
from tqdm import trange

from model import INPUT_SIZE, SMALL_SIZE_INPUT

def main(url='http://127.0.0.1:8080/predictions/model'):
    if not SMALL_SIZE_INPUT:
        x = np.random.randn(1, 1, INPUT_SIZE[0], INPUT_SIZE[1])
    else:
        x = np.random.randn(1, 1, 10, 10)

    # Warm-up request
    requests.post(url, data=x.tobytes())

    print('Issuing requests')
    results = []
    for _ in trange(100):
        results.append(requests.post(url, data=x.tobytes()))

if __name__ == '__main__':
    main()

what I get are the following measurements:

Running on V100 GPU and SMALL_SIZE_INPUT = True, around 300 it/s
... and SMALL_SIZE_INPUT = False it drops to just 8.17 it/s!!

Needless to say, when running with large data at a meagre ~ 8 it/s, GPU utilization is basically 0%. I suppose all the time is spent just reading the incoming data by whatever server TorchServe is using to wait for data (it doesn't seem to be the conversion numpy array -> torch tensor -> GPU, because that also happens in Eager mode). For GUnicorn the solution proposed by the person who helped out on the stackoverflow above was to increase the amount of data copied to buffer in one go (by default GUnicorn copies 1kb of incoming data per step in a while loop, so in this case it becomes a big while loop with about 30k steps), for TorchServe I don't know :-)

@mspinaci Can you please share details of the model and inputs sizes:

How large is your model (rough size eg ~100G)?
Size of your typical input image data file that you are processing
Sample image file (if you can share that)
Are you using batch processing or single inference?
Have you run the benchmark / snakeviz profiler on your model? If yes please share the details for it
Did you try out gRPC API already for large image files?

@mspinaci Thanks for opening this ticket and sorry for the delay, I am looking into this issue and was wondering if you could share some more details.

What is the model size?
Have you tried scaling the number of workers?
Are you using batch inference -- usually the better gpu utilization come from higher batch size
Please share your observation if you have monitored the handler time, inference time, queue time from ts_log.log -- these will add more insights.
Do you have any target throughput/ latency.

Hello @chauhang and @HamidShojanazeri , sorry for the late answer, I just came back from vacation. Please find below answer to your questions (for details, see the fully reproducible example I posted a couple of comments up...)

Model size: it doesn't really matter that much. If the model is big, the problem is harder to see because the model time becomes the bottleneck, so in the example I used as tiny a model as I could (1 single 3x3 convolutional layer with 1 input channel and 1 output channel, so I guess 9 parameters, i.e. 36 bytes in total?)
Size of the input image matters much more, as, I think, it's indeed an issue with the data pipeline (i.e. loading input data). If I pass a tiny image (10x10 image size, i.e. if I set SMALL_INPUT_SIZE=True in the code) there is no problem (300 it/s), with a large image (3000x3000 image size, i.e. about 36MB, with the flag = False in the code) then it plummets (about 8 it/s)
I don't have a sample image, I create it on the fly with random stuff, see the code
No batch, single inference. GPU usage is not even barely the throughput, as it is all about handling data and GPU is basically not in use...
For timing, I just used tqdm. I haven't tried using profiling, if I did that in the client requesting images I'd just see that it's waiting for an answer, I guess one would need to use profiling inside the call to torchserve but I haven't tried that.
I haven't tried gRPC API, it might well be that it solves the issue, but I decided I had spent enough time with a technology I ended up not using... (since Eager mode solved my problem well, and was easier to use than TorchServe)
I haven't tried changing number of workers either, but I'd be a bit surprised if that mattered, because I am querying the model in a synchronous way
I don't have a target throughput in mind, but given that the setup is doing essentially nothing except moving some data from the client to the server, I'd expect it to be almost as fast as just allocating that amount of memory. Since one image is around 36MB, I'd expect at least 20-30 iterations per seconds? (i.e. at least one GB / second).

@mspinaci Thanks for the explanations, I agree with you that input size matter the most in this case, unfortunately, I couldn't repro using your example. In case of using the bigger input as provided in your example, it actually is not going through (it shows 8 iter/s as a result of tqdm), you can check this in the access.log, model_log.log and ts_log.log, when the small input is selected, it gets some error in preprocessing of the base-handler and the error get reflected in the model_log.log, . I am looking into this issue/ try to repro, it would be great if you could share what you get as the result of running the above example?

-- update if you replace the request.post with curl command as below:

for _ in trange(10):
        os.system("curl http://127.0.0.1:8080/predictions/model --data-binary x")

although it still fails at the preprocessing "linearized_image = np.frombuffer(image, dtype=np.float32)" but the iteration/s goes up to 72 in our tests.

Logs https://gist.github.com/HamidShojanazeri/ea0320fc3e4be3cb93ca580ebf64d58e

Can you please run some tests (with the working preprocessing step) to make sure that data is passed to the backend (through checking the logs / access/model/ts logs) and let us know what number you are seeing.

@mspinaci there is also another workaround, that instead of sending the large payloads over http, you can read it in the handler from a shared location (e.g s3 bucket), please let us know if that can work for your use-case.

pytorch / serve