TorchServe no prediction when input data gets bigger (Backend worker did not respond in given time)

tednaseri commented 2 years ago

🐛 Describe the bug

I am passing JSON data to python-requests. For simplification you can assume the following input:

dic1 = {"main": "this is a main", "categories": "this is a categories"}
count = 2
input_data = [dic1 for i in range(count)]
response = requests.post(url, json=input_data)

Issue: When the count<= 8 --> it is working well. As soon as count>8 --> it stuck and never returns.

as you see the input is just a simple python dictionary and if I set input_data = [dic1 for i in range(10)] the final size of the input would be very small.

I am using:

custom handler
ML trained model is based on simpletransformer
Ubuntu 22.04 and 20.04
GPU (local: RTX 3060, Kubernetes: T4)
I have tested on local machines and Kubernetes. The issue is the same

When the issue shows itself: TorchServe: on the GPU, it is critically dependent on the input data size.

When it works well: TorchServe: On the CPU, it is working well regardless of the size of input. PyTorch without TorchServe: I have tested it on PyTorch it is working well even when I pass input_data = [dic1 for i in range(1000)]

Error logs

2022-08-23 16:34:14,800 [DEBUG] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-hardnews_1.0 State change null -> WORKER_STARTED
2022-08-23 16:34:14,804 [INFO ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9000
2022-08-23 16:34:22,821 [INFO ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 7931
2022-08-23 16:34:22,821 [DEBUG] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-hardnews_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2022-08-23 16:34:48,850 [INFO ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 1093
2022-08-23 16:34:48,852 [DEBUG] W-9000-hardnews_1.0 org.pytorch.serve.job.Job - Waiting time ns: 239159, Backend time ns: 1094847821
2022-08-23 16:34:53,317 [INFO ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 156
2022-08-23 16:34:53,318 [DEBUG] W-9000-hardnews_1.0 org.pytorch.serve.job.Job - Waiting time ns: 144077, Backend time ns: 157878185
2022-08-23 16:35:01,126 [INFO ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 224
2022-08-23 16:35:01,127 [DEBUG] W-9000-hardnews_1.0 org.pytorch.serve.job.Job - Waiting time ns: 140180, Backend time ns: 225271057
2022-08-23 16:35:38,326 [INFO ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 30000
2022-08-23 16:35:38,326 [ERROR] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Number or consecutive unsuccessful inference 1
2022-08-23 16:35:38,327 [ERROR] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error
org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time
    at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:198)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
2022-08-23 16:35:38,328 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_MODEL_LOADED
2022-08-23 16:35:38,335 [DEBUG] W-9000-hardnews_1.0 org.pytorch.serve.job.Job - Waiting time ns: 64717, Inference time ns: 30009467849
2022-08-23 16:35:38,335 [DEBUG] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-hardnews_1.0 State change WORKER_MODEL_LOADED -> WORKER_STOPPED
2022-08-23 16:35:38,335 [WARN ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-hardnews_1.0-stderr
2022-08-23 16:35:38,335 [WARN ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-hardnews_1.0-stdout
2022-08-23 16:35:38,336 [INFO ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.

Installation instructions

I don't use docker for installation

Model Packaing

# Running archiver
torch-model-archiver -f --model-name model \
--version 1.0 \
--serialized-file model_folder/pytorch_model.bin \
--export-path model-store \
--requirements-file requirements.txt \
--extra-files "model_folder/config.json,model_folder/merges.txt,model_folder/model_args.json,model_folder/special_tokens_map.json,model_folder/tokenizer.json,model_folder/tokenizer_config.json,model_folder/training_args.bin,model_folder/vocab.json" \
--handler handler.py

config.properties

inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
install_py_dep_per_model=true
NUM_WORKERS=1
number_of_gpu=1
number_of_netty_threads=4
netty_client_threads=1
MKL_NUM_THREADS=1
batch_size=1
max_batch_delay=10
job_queue_size=1000
model_store=/home/model-server/shared/model-store
model_snapshot={"name": "startup.cfg","modelCount": 1,"models": {"news": {"1.0": {"defaultVersion": true,"marName": "news.mar","minWorkers": 1,"maxWorkers": 1,"batchSize": 1,"maxBatchDelay": 10,"responseTimeout": 120}}}}

- I have tried different values for threading, but not helpful

Versions

Environment headers
------------------------------------------------------------------------------------------
Torchserve branch: 

torchserve==0.4.0b20210521
torch-model-archiver==0.4.0b20210521

Python version: 3.8 (64-bit runtime)
Python executable: /home/ted/anaconda3/envs/myland/bin/python3

Versions of relevant python libraries:
captum==0.5.0
future==0.18.2
numpy==1.23.1
psutil==5.9.1
pytest==4.6.11
pytest-forked==1.4.0
pytest-timeout==1.4.2
pytest-xdist==1.34.0
requests==2.28.1
requests-mock==1.9.3
requests-oauthlib==1.3.1
sentencepiece==0.1.95
simpletransformers==0.62.0
torch==1.12.1
torch-model-archiver==0.4.0b20210521
torch-workflow-archiver==0.1.0b20210521
torchaudio==0.12.1
torchserve==0.4.0b20210521
torchvision==0.13.1
transformers==4.20.1
wheel==0.37.1
torch==1.12.1
**Warning: torchtext not present ..
torchvision==0.13.1
torchaudio==0.12.1

Java Version:

OS: Ubuntu 22.04 LTS
GCC version: (Ubuntu 11.2.0-19ubuntu1) 11.2.0
Clang version: N/A
CMake version: N/A

Is CUDA available: Yes
CUDA runtime version: N/A
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 3060 Laptop GPU
Nvidia driver version: 515.65.01
cuDNN version: None

Repro instructions

# Running archiver
torch-model-archiver -f --model-name model \
--version 1.0 \
--serialized-file model_folder/pytorch_model.bin \
--export-path model-store \
--requirements-file requirements.txt \
--extra-files "model_folder/config.json,model_folder/merges.txt,model_folder/model_args.json,model_folder/special_tokens_map.json,model_folder/tokenizer.json,model_folder/tokenizer_config.json,model_folder/training_args.bin,model_folder/vocab.json" \
--handler handler.py

torchserve --start --model-store model-store --models model=hardnews --ncs --ts-config config.properties

running prediction

Possible Solution

No response

tednaseri commented 2 years ago

The reason behind sending a list of input data instead of single input: The computational performance
I have increased responseTimeout, but it does not help

agunapal commented 2 years ago

Taking a look. Will get back to you

agunapal commented 2 years ago

@tednaseri Can you please share how you are pre-processing and running inference on the input data. In another usecase, I have tried sending batch of images (10) as json data and processing them in a single batch and this works. So, I would need more details on your implementation to repro this. For example, it would be great if you can use the HuggingFace transformer example given in the README to modify the custom handler and see if you are able to repro the problem. Thats the example I am going to try.

tednaseri commented 2 years ago

@agunapal Thank you so much for the response. For an easier communication, I have tried to simplify the custom handler while it repros the problem. For this purpose, I imagine that the input data is just a digit, then the handler makes a dummy input as follows:

handler(input_number):
    data = ["sample text" for i in range(input_number)]
    model.predict(data)

Using this handler, it still faces the issue. Here is the prepared handler:

from abc import ABC
import logging
import torch
import transformers
from simpletransformers.classification import ClassificationModel
from ts.torch_handler.base_handler import BaseHandler

logger = logging.getLogger(__name__)
logger.info("Transformers version %s", transformers.__version__)

class TransformersCustomHandler(BaseHandler, ABC):

    def __init__(self):
        super(TransformersCustomHandler, self).__init__()
        self.initialized = False

    def initialize(self, context):

        self.context = context
        self.manifest = context.manifest
        properties = context.system_properties
        self.model_folder = properties.get("model_dir")

        if torch.cuda.is_available() and properties.get("gpu_id") is not None:
            self.device = torch.device("cuda:" + str(properties.get("gpu_id")))
            self.use_cuda = True
        else:
            self.device = torch.device("cpu")
            self.use_cuda = False

        self.predictions = []
        self.labels = ['no', 'yes']
        self.model = self.load_model()

        # The following line does not work for simple transformer
        # self.model.to(self.device)
        # self.model.eval()
        self.initialized = True

    def load_model(self):
        model = ClassificationModel('roberta', self.model_folder, use_cuda=self.use_cuda)
        return model

    def predict(self, param):
        self.predictions = []
        count = param[0]["count"].decode("utf-8")
        count = int(count)

        input_text = "sample text"
        data = [input_text for i in range(count)]

        preds, out_results = self.model.predict(data)
        label_lst = [self.labels[i] for i in preds]

        for i in range(len(label_lst)):
            prediction = {"label": label_lst[i]}
            self.predictions.append(prediction)

    def get_predictions(self):
        return self.predictions

    # _service = TransformersCustomHandler()
    def handle(self, data, context):
        try:
            # if not _service.initialized:
            #     _service.initialize(context)
            #
            # if data is None:
            #     return None
            self.predict(data)
            result = [self.get_predictions()]
            return result

        except Exception as e:
            raise e

agunapal commented 2 years ago

@tednaseri I used the below handler and I tried with json payloads of length 1000 with T4 GPU. It works

`from abc import ABC import logging import torch import transformers from transformers import ( AutoModelForSequenceClassification, AutoTokenizer, )

from ts.torch_handler.base_handler import BaseHandler

logger = logging.getLogger(name) logger.info("Transformers version %s", transformers.version)

class TransformersHandler(BaseHandler, ABC): """ Transformers handler class for sequence classification. """

def __init__(self):
    super(TransformersHandler, self).__init__()
    self.initialized = False

def initialize(self, ctx):
    """In this initialize function, the BERT model is loaded
    Args:
        ctx (context): It is a JSON Object containing information
        pertaining to the model artefacts parameters.
    """
    self.manifest = ctx.manifest
    properties = ctx.system_properties
    model_dir = properties.get("model_dir")

    self.device = torch.device(
        "cuda:" + str(properties.get("gpu_id"))
        if torch.cuda.is_available() and properties.get("gpu_id") is not None
        else "cpu"
    )

    self.model = AutoModelForSequenceClassification.from_pretrained(
                model_dir
            )
    self.model.to(self.device)

    self.tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased",
            do_lower_case=True
        )

    self.model.eval()
    logger.info("Transformer model from path %s loaded successfully", model_dir)

def preprocess(self, requests):
    """Basic text preprocessing, based on the user's chocie of application mode.
    Args:
        requests (str): The Input data in the form of text is passed on to the preprocess
        function.
    Returns:
        list : The preprocess function returns a list of Tensor for the size of the word tokens.
    """
    inputs = None
    for idx, data in enumerate(requests):
        input_text = data.get("data") or data.get("body")
        input_text = input_text["text"]

        inputs = self.tokenizer(input_text, return_tensors="pt")

    return inputs

def inference(self, data, *args, **kwargs):
    """
    The Inference Function is used to make a prediction call on the given input request.
    The user needs to override the inference function to customize it.

    Args:
        data (Torch Tensor): A Torch Tensor is passed to make the Inference Request.
        The shape should match the model input shape.

    Returns:
        Torch Tensor : The Predicted Torch Tensor is returned in this function.
    """

    mask = data['attention_mask'].to(self.device)
    input_id = data['input_ids'].squeeze(1).to(self.device)
    with torch.no_grad():
        results = self.model(input_id, mask)
    return results

def postprocess(self, data):
    result = data.logits.argmax(dim=1)
    result = result.tolist()
    return [result]

`

agunapal commented 2 years ago

Here is the client part ` import requests import json

api = "http://127.0.0.1:8080/predictions/my_tc" headers = {'Content-type': 'application/json', 'Accept': 'text/plain'}

payload = {"text":["Bloomberg has decided to publish a new report on the global economy." for i in range(1000)]}

payload = json.dumps(payload) response = requests.post(api, data=payload, headers=headers)

print(response.content.decode("UTF-8")) `

tednaseri commented 2 years ago

@agunapal Thank you so much for the response. Your test shows that the input sample of 1000 works.

By the way, there are some differences that I cannot get that much from the response.

I am using SimpleTransformer while the test case is based on AutoModelForSequenceClassification.
AutoModelForSequenceClassification accept model.eval() while the SimpleTransformer does not.
I am passing the full text, while the test case passes the tokens.
Could you test any SimpleTransformer model on text classifier? I am wondering if there is full compatibility between the GPU processing of SimpleTransformer and TorchServe. Because:
My sample code works well on CPU
The model is able to process 1000 of text on PyTorch but not in TorchServe.

Maybe I need to switch to fastAPI and manual serving.

tednaseri commented 2 years ago

Hi @agunapal, I have tested another transformer model with the same custom handler, there is no issue there. I think that the issue is an incompatibility between SimpleTransformer and PyTorch. I am wondering, have you ever tested any SimpleTransformer model with PyTorch for text classification?

agunapal commented 2 years ago

@tednaseri I am not sure if this has been tested. If you think SimpleTransformers add value and want to create an example showing the integration, please feel free to create a PR and get feedback.

bennykins commented 8 months ago

@tednaseri I am dealing the same issue right now. Can you share with me what was the solution?

pytorch / serve