pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.23k stars 864 forks source link

TorchServe How to Curl Multiple Images Properly #1692

Open Hegelim opened 2 years ago

Hegelim commented 2 years ago

I am using TorchServe to potentially serve a model from MMOCR (https://github.com/open-mmlab/mmocr), and I have several questions:

  1. I tried to do inference on hundreds of images together using batch mode by using & to concatenate curl commands together, such as suggested here https://github.com/pytorch/serve/issues/1235#issuecomment-938231201. However, this doesn't provide a neat solution if I have hundreds of curls concatenated together. I can of course have a super long command that looks like

    curl -X POST http://localhost:8080/predictions/ABINet -T image1.png & curl -X POST http://localhost:8080/predictions/ABINet -T image2.png & curl -X POST http://localhost:8080/predictions/ABINet -T image3.png & curl -X POST http://localhost:8080/predictions/ABINet -T image4.png &... 

    But I don't think this is the right way to go. My questions are: is using & really parallel? What is a good/suggested way to do inference on hundreds of images? What is a Pythonic way to do this (maybe using requests/subprocess)?

  2. I used config.properties file that looks like below

    Inference address: http://127.0.0.1:8080
    Management address: http://127.0.0.1:8081
    Metrics address: http://127.0.0.1:8082
    load_models=ABINet.mar
    models={\
    "ABINet": {\
    "1.0": {\
        "defaultVersion": true,\
        "marName": "ABINet.mar",\
        "runtime": "python",\
        "minWorkers": 1,\
        "maxWorkers": 8,\
        "batchSize": 200,\
        "maxBatchDelay": 50,\
        "responseTimeout": 120,\
        "max_request_size": 65535000\
    }\
    }\
    }

    I noticed that each time I do inference (using curl -X POST http://localhost:8080/predictions/ABINet T image1.png & curl -X POST http://localhost:8080/predictions/ABINet T image2.png &... hundreds of times concatenated), the GPU usage will increase, and the memory wouldn't be released after the inference is done.

For example, if I want to do inference on 300 images with config.properties that looks like

Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
load_models=ABINet.mar
models={\
  "ABINet": {\
    "1.0": {\
        "defaultVersion": true,\
        "marName": "ABINet.mar",\
        "runtime": "python",\
        "minWorkers": 4,\
        "maxWorkers": 8,\
        "batchSize": 600,\
        "maxBatchDelay": 50,\
        "responseTimeout": 120,\
        "max_request_size": 65535000\
    }\
  }\
}

using gpustat, after I start torchserve, before I run the first inference, the GPU usage looks like

image

After running the inference the 1st time, the GPU usage looks like

image

After running the inference the 2nd time,

image

So if I do this inference on hundreds of images for 3 times, it will break and error like

{
  "code": 503,
  "type": "ServiceUnavailableException",
  "message": "Model \"ABINet\" has no worker to serve inference request. Please use scale workers API to add workers."
}

Now, I tried registering model with initial_workers as suggested here https://github.com/pytorch/serve/issues/29 but with no luck. My questions are:

I am attaching the MMOCR custom handler for reference

class MMOCRHandler(BaseHandler):
    threshold = 0.5

    def initialize(self, context):
        properties = context.system_properties
        self.map_location = 'cuda' if torch.cuda.is_available() else 'cpu'
        self.device = torch.device(self.map_location + ':' +
                                   str(properties.get('gpu_id')) if torch.cuda.
                                   is_available() else self.map_location)
        self.manifest = context.manifest

        model_dir = properties.get('model_dir')
        serialized_file = self.manifest['model']['serializedFile']
        checkpoint = os.path.join(model_dir, serialized_file)
        self.config_file = os.path.join(model_dir, 'config.py')

        self.model = init_detector(self.config_file, checkpoint, self.device)
        self.initialized = True

    def preprocess(self, data):
        images = []
        for row in data:
            image = row.get('data') or row.get('body')
            if isinstance(image, str):
                image = base64.b64decode(image)
            image = mmcv.imfrombytes(image)
            images.append(image)

        return images

    def inference(self, data, *args, **kwargs):

        results = model_inference(self.model, data, batch_mode=True)
        return results

    def postprocess(self, data):
        # Format output following the example OCRHandler format
        return data

This is driving me nuts. Any help is appreciated.

jack-gits commented 2 years ago

great questions. follow up.

msaroufim commented 2 years ago

Unfortunately requests is a synchronous library, there are some alternatives like grequests or asyncio with the latter growing in popularity that should resolve this issue #1489. I can take a look at producing something in examples as a tutorial for others since I've seen people get bit by this a few times.

How to set this config.properties properly to handle this situation? How would I know what to set for batchsize and maxBatchDelay?

I don't think this is the root cause of the issue you're seeing general suggestions for how to set these guys is now here #1699

How to allow torchserve to release memory after one inference? Is there something similar to gc.collect() or torch.cuda.reset_peak_memory_stats(device=None)?

This hasn't been needed for any of the models we support so far, so my suggestion is to first see if this issue goes away with an asyncio example instead of using curl. The model you're using could be doing some strange allocations and would need more time to debug that.

How does TorchServe work under the hood? If I send a request with hundreds of images, say, 600, will TorchServe take all in or take only whatever portion it can take? Or will it automatically partition the request (say, take 300 the first time, then take the rest 300)?

Torchserve has a number of workers each of which can take some amount of requests depending on the batch size and max batch delay. When you make a request to torchserve, your request gets added to a queue that's then popped to the next available worker in a round robin fashion

Hegelim commented 2 years ago

Thank you so much for the comment! I tried re-writing my code using asyncio and aiohttp and the python file looks like below

import aiohttp
import asyncio
import time

start_time = time.time()

async def get_res(session, url, image):
    with open(image, "rb") as f:
        async with session.post(url, data={"data": f}) as resp:
            res = await resp.text()
            return res

async def main():

    connector = aiohttp.TCPConnector(limit=1000)

    async with aiohttp.ClientSession(connector=connector) as session:

        tasks = []
        url = 'http://localhost:8080/predictions/ABINet'
        for i in range(611):
            image = f"images/forms/{i}.png"
            tasks.append(asyncio.ensure_future(get_res(session, url, image)))

        original_images = await asyncio.gather(*tasks)
        for img in original_images:
            print(img)

asyncio.run(main())
print(f"Time: {time.time() - start_time}")

However, the issue of GPU usage still remains - each call of this python file will boost the GPU memory usage, by the end of 2nd call my GPU is already full. Am I doing this in the right way?

msaroufim commented 2 years ago

I need to take a closer look at how asyncio works (I haven't used it much) I suspect a request is not freeing the resources that it's grabbing. Typically for these quick experiments the team uses postman https://github.com/pytorch/serve/tree/master/test#adding-tests

That said I've been meaning to come up with a good asyncio example for a while now so will let you know when we prioritize it

anishchhaparwal commented 2 years ago

Facing the same issue as @Hegelim. GPU memory usage keeps increasing post each inference batch.

TzeSing commented 1 year ago

I encounter the same problem, the memory does not release after the requests completed

mmeendez8 commented 1 year ago

Same here, I created my own script with asyncio and observe a similar effect.

The GPU memory remains occupied even after processing all the requests, posing a challenge in accurately measuring GPU utilization and determining the optimal batch size and timeout configuration.

I am going to try with Apache benchmark now as pointed in the Model Server Benchmarking since it provides some good plots that might be useful

PD: Before seeing this thread I was assuming the problem was related with how torch was handling the cache: https://pytorch.org/docs/stable/notes/cuda.html#memory-management

I attach the script I was using in case in can be useful. It uses asyncio and httpx.

import asyncio
from httpx import AsyncClient

async def inference_frames(batch_size: int, url: str):

    async with AsyncClient() as client:

        tasks = []

        for _ in range(batch_size):
            payload = dict(data=IMAGE)
            tasks.append(client.post(url=url, json=payload))

        return await asyncio.gather(*tasks)

# batch_size = number of requests we send concurrently
results = asyncio.run(inference_frames(batch_size=10, url='http://localhost:8080/predictions/ABINet'))
OmniaZayed commented 4 months ago

Hi, I wonder if there are any updates on this. I am facing the same problem when sending a large request to the served model. The GPU memory is not released after the request(s) are completed, leading to a CUDA out of memory error for subsequent requests with code 507.