The handler only gets the correct value from cuda:0.

YongWookHa commented 2 years ago

🐛 Describe the bug

I am using 2 GPU. Torchserve inference returns correct values only for predictions run on cuda:0.

Error logs

x = "text to embed"

url = f"http://localhost:9080/predictions/my-model"
x_emb_1 = requests.post(url, data = x).json()
x_emb_2 = requests.post(url, data = x).json()

x_emb_1 != x_emb_2 # True

Installation instructions

docker pull pytorch/torchserve:latest-gpu

Model Packaing

My handler looks like this

import torch
from pathlib import Path
from ts.torch_handler.base_handler import BaseHandler

from model import MyModel

class MyModel_Handler(BaseHandler):
    def __init__(self):
        pass

    def initialize(self, context):
        # load the model
        self.manifest = context.manifest
        properties = context.system_properties
        model_dir = properties.get("model_dir")
        self.device = torch.device("cuda:" + str(properties.get("gpu_id")) if torch.cuda.is_available() else "cpu")

        serialized_file = self.manifest['model']['serializedFile']
        model_pt_path = Path(model_dir) / serialized_file
        if not model_pt_path.exists():
            raise RuntimeError("Missing the model.pt file")
        hparams = {
            "load_ckpt": model_pt_path,
            "seq_len": 10,
            "window_size": 128
        }
        self.model = MyModel(hparams).to(self.device)  # from transformers import AutoModel
        self.model = self.model.eval()

        self.initialized = True

    def preprocess(self, data):
        inp = [d.get("body").decode('utf-8') for d in data]
        return inp

    def inference(self, data):
        with torch.no_grad():
            results = self.model(data)
        return results

    def postprocess(self, inference_output):
        return inference_output.tolist()

config.properties

'batch_size': 512
'max_batch_delay': 100
'min_worker': 2
'max_worker': 2

else : default setting

Versions

docker-hub: pytorch/torchserve:0.6.0-gpu

Repro instructions

.

Possible Solution

.

YongWookHa commented 2 years ago

Updated the issue. I thought the handler returns the correct value only in the cuda:0. However, it turns out that alternating requests return the correct value. In a cycle of 2. Although I run the torchserve container with single gpu. Here's an example.

x = 'test text'
url = f"http://localhost:9080/predictions/my-model"

emb_x = requests.post(url, data = x).json() 

emb_x_1 = requests.post(url, data = x).json() 
emb_x == emb_x_1  # False

emb_x_2 = requests.post(url, data = x).json()
emb_x == emb_x_2  # True

emb_x_3 = requests.post(url, data = x).json()
emb_x == emb_x_3  # False

emb_x_4 = requests.post(url, data = x).json() 
emb_x == emb_x_4  # True

agunapal commented 2 years ago

@YongWookHa Could you please clarify if your setup is single GPU or multiple GPU. Also, could you please share some details on the model ( some example of an open source model) so I can repro it.

agunapal commented 2 years ago

Closing since there is no followup. Please re-open when you get a chance

pytorch / serve