Closed mspinaci closed 2 years ago
@mspinaci Did you solve this?
Hello @juyunsang ,
Not really, but at least I have some theories. In my project I ended up using Eager mode behind a Flask server, which worked fine and had also a lot less headaches than TorchServe to export / import (e.g. not asking me to create a model without parameters inside a file with no other instances of nn.Module
, that was really really annoying). Plus it makes production and training closer, which I also prefer.
Anyway, with some more digging, I think at least part of the slowdown might come from torchserve being slow to process large input files. It's a somehow funny because I had the very same problem with the Flask implementation if I used GUnicorn or Waitress (for those interested, see here: https://stackoverflow.com/questions/67938278/waitress-and-gunicorn-large-data-input-is-much-slower-than-flask-development-ser/68079761)
Here is a fully reproducible test that you can run to check the same:
model.py
:
import torch
from torch import nn
SMALL_SIZE_INPUT = True
INPUT_SIZE = (3000, 3000)
class Model(nn.Module):
def __init__(self):
super().__init__()
if SMALL_SIZE_INPUT:
self.upsample = nn.Upsample(size=INPUT_SIZE)
else:
self.upsample = nn.Identity()
self.conv = nn.Conv2d(
in_channels=1,
out_channels=1,
kernel_size=3,
padding=1
)
def forward(self, x):
x = self.upsample(x)
x = self.conv(x)
return torch.mean(x)
time_eager.py
:
import torch
import numpy as np
from tqdm import trange
from model import Model, SMALL_SIZE_INPUT, INPUT_SIZE
model = Model()
model.eval()
model.cuda()
device = torch.device('cuda')
# device = torch.device('cpu')
if not SMALL_SIZE_INPUT:
x = np.random.randn(1, 1, INPUT_SIZE[0], INPUT_SIZE[1]).astype(np.float32)
else:
x = np.random.randn(1, 1, 10, 10).astype(np.float32)
# Warm up
model(torch.from_numpy(x).to(device))
torch.cuda.synchronize()
for _ in trange(100):
model(torch.from_numpy(x).to(device));
torch.cuda.synchronize()
torch.save(model.state_dict(), 'weights.pth')
Running the above code in Eager mode I got the following timings:
SMALL_SIZE_INPUT = True
, around 570 it/sSMALL_SIZE_INPUT = False
, around 170 it/sSMALL_SIZE_INPUT
flag (I guess convolution becomes the bottleneck).The difference between small and large size comes from transforming a big numpy array to torch tensor and sending it to GPU, which is expected (if I do that just once outside of the loop, it ends up being even faster at 670 it/s, I guess because it saves time not doing the Upsample step)
Now, I try to do the same on TorchServe. To do so, I create handler.py
:
import torch
from math import log
import numpy as np
from PIL import Image
import base64
from ts.torch_handler.vision_handler import BaseHandler
class ModelHandler(BaseHandler):
input_res = (3000, 3000)
def preprocess(self, data):
images = []
for row in data:
# Compat layer: normally the envelope should just return the data
# directly, but older versions of Torchserve didn't have envelope.
image = row.get("data") or row.get("body")
if isinstance(image, str):
# if the image is a string of bytesarray.
image = base64.b64decode(image)
assert isinstance(image, (bytearray, bytes)), f'Type of image {type(image)} not understood'
linearized_image = np.frombuffer(image, dtype=np.float32)
# Resize to image format by also adding channel dimension
images.append(linearized_image.reshape((1,) + self.input_res))
return torch.stack(images).to(self.device)
and export the model with
torch-model-archiver --model-name model --version 1.0 \
--model-file model.py \
--serialized-file weights.pth \
--handler handler.py
mkdir model_store
mv *.mar model_store
Like above, I create config.properties
including:
number_of_gpu=1
max_request_size=50000000
and start the server via:
CUDA_VISIBLE_DEVICES=0 torchserve --start \
--model-store model_store \
--models model.mar \
--ts-config config.properties > log.txt
When I then query this with the following request.py
script:
import requests
import os
import numpy as np
from tqdm import trange
from model import INPUT_SIZE, SMALL_SIZE_INPUT
def main(url='http://127.0.0.1:8080/predictions/model'):
if not SMALL_SIZE_INPUT:
x = np.random.randn(1, 1, INPUT_SIZE[0], INPUT_SIZE[1])
else:
x = np.random.randn(1, 1, 10, 10)
# Warm-up request
requests.post(url, data=x.tobytes())
print('Issuing requests')
results = []
for _ in trange(100):
results.append(requests.post(url, data=x.tobytes()))
if __name__ == '__main__':
main()
what I get are the following measurements:
SMALL_SIZE_INPUT = True
, around 300 it/sSMALL_SIZE_INPUT = False
it drops to just 8.17 it/s!!Needless to say, when running with large data at a meagre ~ 8 it/s, GPU utilization is basically 0%. I suppose all the time is spent just reading the incoming data by whatever server TorchServe is using to wait for data (it doesn't seem to be the conversion numpy array -> torch tensor -> GPU, because that also happens in Eager mode). For GUnicorn the solution proposed by the person who helped out on the stackoverflow above was to increase the amount of data copied to buffer in one go (by default GUnicorn copies 1kb of incoming data per step in a while loop, so in this case it becomes a big while loop with about 30k steps), for TorchServe I don't know :-)
@mspinaci Can you please share details of the model and inputs sizes:
@mspinaci Thanks for opening this ticket and sorry for the delay, I am looking into this issue and was wondering if you could share some more details.
Hello @chauhang and @HamidShojanazeri , sorry for the late answer, I just came back from vacation. Please find below answer to your questions (for details, see the fully reproducible example I posted a couple of comments up...)
torchserve
but I haven't tried that.@mspinaci Thanks for the explanations, I agree with you that input size matter the most in this case, unfortunately, I couldn't repro using your example. In case of using the bigger input as provided in your example, it actually is not going through (it shows 8 iter/s as a result of tqdm), you can check this in the access.log, model_log.log and ts_log.log, when the small input is selected, it gets some error in preprocessing of the base-handler and the error get reflected in the model_log.log, . I am looking into this issue/ try to repro, it would be great if you could share what you get as the result of running the above example?
-- update if you replace the request.post with curl command as below:
for _ in trange(10):
os.system("curl http://127.0.0.1:8080/predictions/model --data-binary x")
although it still fails at the preprocessing "linearized_image = np.frombuffer(image, dtype=np.float32)" but the iteration/s goes up to 72 in our tests.
Can you please run some tests (with the working preprocessing step) to make sure that data is passed to the backend (through checking the logs / access/model/ts logs) and let us know what number you are seeing.
@mspinaci there is also another workaround, that instead of sending the large payloads over http, you can read it in the handler from a shared location (e.g s3 bucket), please let us know if that can work for your use-case.
While running a (rather) big model on GPU on high res images I can easily achieve 100% GPU utilization in Eager mode. Porting the same model to TorchServe (and removing all CPU pre/post processing) still fails to achieve it, and runtime is almost twice as much.
(I'm relatively new to Pytorch and just started with TorchServe, so this might well be a problem on my end and not a bug, if so I apologize in advance!)
Context
Your Environment
Expected Behavior
I'll paste some code below of how I achieved the reported results in Eager mode and in TorchServe. Unfortunately I cannot paste a replicable test because it's a large model, with a lot of code in it, and the code is non-public and proprietary for the company I work in...
So the assumption is I have a
model.py
file that contains anEncoderDecoder
class (subclass ofnn.Module
). Its weights are saved inweights.pth
. The inference code simply loads images from theimages
folder, rescales them to the high res format we use (about 8 megapixel), feeds to the model (using multiple CPU workers), and then some bounding box postprocessing is applied:This achieves the expected behaviour: GPU is virtually always at 100% utilization (see magenta line below plotted with nvtop), and the loop takes 18 seconds for 50 images (of which a couple of seconds go into warming up the GPU):
Current Behavior
This does not happen when using TorchServe. I had to customize a few things to allow it to run (in theory) pure GPU processes, i.e. the input is the already processed image as a (bytearray representing) a numpy array, and the output is the raw bytes of the numpy array output. To achieve this, I archived the model running
where
handler.py
is:I can then start TorchServe with:
where
config.properties
is edited to specify only one GPU and large input files:Finally, I call the model on the images using the following script:
This results in alternated utilization of GPU and inference taking 30s (almost twice as slow as above) just for the GPU part (I do all the CPU pre/postprocessing on different loops):
Final remarks
In earlier versions I tried some "less custom" handlers, namely to transform the numpy output to a list of float, which is then handled by TorchServe, and to pass the image bytes as input subclassing VisionHandler. But: 1) For the output that was a bad idea and I changed soon, because the transformations numpy -> list -> string were extremely slow for the large output I had (around 10k rows and 15 columns). 2) For Image, that handed the JPEG decoding + creation of numpy array to TorchServe, and given that everything happens sequentially that might be a bad idea. In this version, instead, everything is passed as low-level as possible, which does result in a 32MB (= 329625444 bytes) of string being sent via POST, but given it's all localhost there's no network issue. Output is also rather large (around 1MB, varies from image to image), but also with the low-level encoding over localhost that shouldn't be an issue.
Summing up
Is TorchServe just slower (e.g. due to the large file size), or have I done something stupid above? Thanks in advance for any hint!