triton-inference-server / pytriton

PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.
https://triton-inference-server.github.io/pytriton/
Apache License 2.0
687 stars 45 forks source link

Error deploying model on Vertex AI #45

Closed sricke closed 3 months ago

sricke commented 7 months ago

Description

Hi! I'm trying to deploy a StableDiffusion model in GCP Vertex AI using Pytriton backend. My code works on a local machine, and I've been able to send requests and receive inference responses.

My problem arrives when I'm trying to create an endpoint using Vertex AI. Server run fails with error:

WARNING - pytriton.server.triton_server: Triton Inference Server exited with failure. Please wait.

And then:

failed to start Vertex AI service: Invalid argument - Expect the model repository contains only a single model if default model is not specified
...
raise PyTritonClientTimeoutError("Waiting for server to be ready timed out.")

Don't know if the error with Vertex AI service is due to the server first crashing or vice-versa.

To reproduce

Attaching my server code

# server
import torch
from model import ModelWrapper
from pytriton.decorators import batch
from pytriton.model_config import DynamicBatcher, ModelConfig, Tensor
from pytriton.triton import Triton, TritonConfig, Triton
from urllib.parse import urlparse

class _InferFuncWrapper:
    """
    Class wrapper of inference func for triton. Used to also store the model variable
    """

    def __init__(self, model: torch.nn.Module):
        self._model = model

    @batch
    def __call__(self, **inputs) -> np.ndarray:
        """
        Main inference function for triton backend. Called after batch inference.
        Performs all the logic of decoding inputs, calling the model and returning
        outputs.

        Args:
            prompts: Batch of strings with the user prompts
            init_images: Batch of initial image to run the diffusion

        Returns
            image: Batch of generated images
        """
        (prompts, init_images) = inputs.values()
        # decode prompts and images
        prompts = [np.char.decode(p.astype("bytes"), "utf-8").item() for p in prompts]
        init_images = [
            np.char.decode(enc_img.astype("bytes"), "utf-8").item()
            for enc_img in init_images
        ]
        init_images = [_decode_img(enc_img) for enc_img in init_images]
        # transform image arrays to tensors and adjust dims to torch usage
        images_tensors = torch.tensor(init_images, dtype=torch.float32).permute(
            0, 3, 1, 2
        )
        LOGGER.debug(f"Prompts: {prompts}")
        LOGGER.debug(f"{len(init_images)} images size: {init_images[0].shape}")
        LOGGER.info("Generating images...")
        # call diffusion model
        outputs = self._model.run(prompts, images_tensors)
        LOGGER.debug(f"Prepared batch response of size: {len(outputs)}")
        return {"image": np.array(outputs)}

def _parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--verbose",
        "-v",
        action="store_true",
        help="Enable verbose logging in debug mode.",
        default=True,
    )

    parser.add_argument(
        "--vertex",
        "-s",
        action="store_true",
        help="Enable copying model files from storage for vertex deployment",
        default=False,
    )

    return parser.parse_args()

def main():
    """Initialize server with model."""
    args = _parse_args()

    # initialize logging
    log_level = logging.DEBUG if args.verbose else logging.INFO
    logging.basicConfig(
        level=log_level, format="%(asctime)s - %(levelname)s - %(name)s: %(message)s"
    )

    if args.vertex:
        LOGGER.debug("Vertex: Loading pipeline from Vertex Storage")
        storage_path = os.environ["AIP_STORAGE_URI"]
    else:
        LOGGER.debug("Loading pipeline locally")
        storage_path = ("") # Path to local files

    bucket_name, subdirectory = parse_path(storage_path)
    LOGGER.debug(f"Downloading files... Started at: {time.strftime('%X')}")
    download_blob(bucket_name, subdirectory)
    LOGGER.debug(f"Files downloaded! Finished at: {time.strftime('%X')}")
    folder_path = os.path.join("src", subdirectory)

    LOGGER.debug(f"Running on device: {DEVICE}, dtype: {DTYPE}, triton_port:{PORT}")
    LOGGER.info("Loading pipeline...")
    model = ModelWrapper(logger=LOGGER, folder_path=folder_path)
    LOGGER.info("Pipeline loaded!")

    log_verbose = 1 if args.verbose else 0

    config = TritonConfig(http_port=8015, exit_on_error=True, log_verbose=log_verbose)

    with Triton(config=config) as triton:
        # bind the model with its inference call and configuration
        triton.bind(
            model_name="StableDiffusion_Img2Img",
            infer_func=_InferFuncWrapper(model=model),
            inputs=[
                Tensor(name="prompt", dtype=np.bytes_, shape=(1,)),
                Tensor(name="init_image", dtype=np.bytes_, shape=(1,)),
            ],
            outputs=[
                Tensor(name="image", dtype=np.bytes_, shape=(1,)),
            ],
            config=ModelConfig(
                max_batch_size=4,
                batcher=DynamicBatcher(
                    max_queue_delay_microseconds=100,
                ),
            ),
            strict=True,
        )
        # serve the model for inference
        triton.serve()

if __name__ == "__main__":
    main()

When creating Vertex endpoint, server predict route is configured to: /v2/models/StableDiffusion_Img2Img/infer

And server health route is configured to: /v2/health/live

With Vertex port=8015, same as HTTP port set in model configuration.

Observed results and expected behavior

As stated, server runs on local machine, but fails initializing endpoint in Vertex AI. During Vertex build, local files are correctly downloaded and model pipeline is loaded, so error is probably in triton.bind() function. Attaching complete log output:

DEBUG - StableDiffusion_Img2Img.server: Files downloaded! Finished at: 18:28:01
DEBUG - StableDiffusion_Img2Img.server: Running on device: cuda, dtype: torch.float16, triton_port:8015
INFO - StableDiffusion_Img2Img.server: Loading pipeline..
INFO - StableDiffusion_Img2Img.server: Pipeline loaded!

...
2023-11-23 18:29:10,322 - DEBUG - pytriton.triton: Triton Inference Server binaries ready in /root/.cache/pytriton/workspace_y7vpgv3x/tritonserver
2023-11-23 18:29:10,322 - DEBUG - pytriton.utils.distribution: Obtained pytriton module path: /usr/local/lib/python3.10/dist-packages/pytriton
 2023-11-23 18:29:10,323 - DEBUG - pytriton.utils.distribution: Obtained nvidia_pytriton.libs path: /usr/local/lib/python3.10/dist-packages/nvidia_pytriton.libs
2023-11-23 18:29:10,323 - DEBUG - pytriton.client.client: Creating InferenceServerClient for http://127.0.0.1:8015 with {'network_timeout': 60.0, 'connection_timeout': 60.0}
2023-11-23 18:29:10,323 - DEBUG - pytriton.client.client: Creating InferenceServerClient for http://127.0.0.1:8015 with {'network_timeout': 60.0, 'connection_timeout': 60.0}
2023-11-23 18:29:10,323 - DEBUG - pytriton.triton: Starting Triton Inference
2023-11-23 18:29:10,324 - DEBUG - pytriton.server.triton_server: Triton Server binary /root/.cache/pytriton/workspace_y7vpgv3x/tritonserver/bin/tritonserver. Environment:
{
...
}
2023-11-23 18:29:10,449 - DEBUG - pytriton.client.utils: Waiting for server to be ready (timeout=119.99996042251587)
2023-11-23 18:29:12,954 - WARNING - pytriton.server.triton_server: Triton Inference Server exited with failure. Please wait
2023-11-23 18:29:12,954 - DEBUG - pytriton.server.triton_server: Triton Inference Server exit code 1
2023-11-23 18:29:12,954 - DEBUG - pytriton.triton: Got callback that tritonserver process finished
2023-11-23 15:31:10.655 Traceback (most recent call last):
2023-11-23 15:31:10.655 File "/home/app/src/server.py", line 200, in <module>
2023-11-23 18:31:10,655 - DEBUG - pytriton.triton: Cleaning model manager, tensor store and workspace.
failed to start Vertex AI service: Invalid argument - Expect the model repository contains only a single model if default model is not specified
2023-11-23 18:31:10,655 - DEBUG - pytriton.utils.workspace: Cleaning workspace dir /root/.cache/pytriton/workspace_y7vpgv3x
raise PyTritonClientTimeoutError("Waiting for server to be ready timed out.")
pytriton.client.exceptions.PyTritonClientTimeoutError: Waiting for server to be ready timed out.

Additional steps taken

From the timeout error raised by Pytriton we've tried increasing timeout by setting monitoring_period_s in server.run() to an arbitrary high threshold.

We've also tied adapting server configuration to Vertex with:

TritonConfig(http_port=8015, exit_on_error=True, log_verbose=log_verbose, allow_vertex_ai=true, vertex_ai_port=8080)

But getting same error.

Environment

Docker base image: nvcr.io/nvidia/pytorch:23.10-py3 Requierements:

torch @ https://download.pytorch.org/whl/cu116/torch-1.12.1%2Bcu116-cp310-cp310-linux_x86_64.whl
diffusers==0.7.2
transformers==4.21.3
ftfy==6.1.1
importlib-metadata==4.13.0
nvidia-pytriton==0.4.1
Pillow==9.5
google-cloud-storage==2.10.0

Any help is appreciated!!

jkosek commented 7 months ago

Hi @sricke could you share a guide how the container is deployed inside Vertex AI?

sricke commented 7 months ago

Sure. Hope this helps.

This guide looks into uploading a custom docker container to create a Vertex AI Model instance

This guide looks into serving previously created model to a Vertex AI Endpoint

This guide looks into Serving Predictions with NVIDIA Triton

So the steps we're taking are:

1. Create pytriton docker image and push to Artifact Registry:

Dockerfile:

FROM nvcr.io/nvidia/pytorch:23.10-py3

# Requirements are installed here to ensure they will be cached.
COPY requirements.txt requirements.txt 

# Set environment variables
ENV PYTHONUNBUFFERED=0
ENV MODEL_NAME=${MODEL_NAME}
ENV TRITON_PORT=${TRITON_PORT}

RUN pip install -r requirements.txt

COPY model.py /home/app/src/model.py
COPY server.py /home/app/src/server.py

WORKDIR /home/app/

CMD python3 src/server.py --vertex
# build docker image
docker build -t pytriton (path to dockerfile folder)

# configure authentification to artifact registry repo
gcloud auth configure-docker $REGION-docker.pkg.dev --quiet

IMAGE_URI=$REGION-docker.pkg.dev/$PROJECT_ID/$DOCKER_ARTIFACT_REPO/pytriton

# Tag and upload model docker image
docker tag pytriton $IMAGE_URI
docker push $IMAGE_URI

2. Create Vertex AI Model Instance, using previously created Artifact Registry image. Screenshot 2023-11-24 at 16 12 05 Screenshot 2023-11-24 at 16 12 19

3. Create Vertex AI Model endpoint, using previously created Model instance. Screenshot 2023-11-24 at 16 13 17 Screenshot 2023-11-24 at 16 13 27

jkosek commented 7 months ago

Thanks @sricke! Let us review that and get back to you.

jkosek commented 7 months ago

Hi @sricke. The PyTriton path for deploying model would be using a custom container described here: https://cloud.google.com/vertex-ai/docs/predictions/use-custom-container

Could you also remove any flag related with VertexAI in TritonConfig and provide the execution log? This should not be necessary and even cause a problems like this error:

failed to start Vertex AI service: Invalid argument - Expect the model repository contains only a single model if default model is not specified

I believe the model file you are providing from Cloud storage is read inside the model.py? The model_repository required for pure Triton based deployment is not needed here.

sricke commented 7 months ago

@jkosek the routes I'm using for Vertex predict is /v2/models/StableDiffusion_Img2Img/infer and for health checkis /v2/health/live.

Yes, I have a model.py file that basically loads a StableDiffusion Pipeline using model files and weights loaded from Cloud storage. We've also tried using a standard Yolo model ending up with same errors.

If I remove the VertexAI flags in TritonConfig the model is loaded correctly, but VertexAI sends a series of health checks that return error 400 and eventually shuts down the server. Attaching logs:

I1117 19:51:48.680199 138 vertex_ai_server.cc:350] Started Vertex AI HTTPService at 0.0.0.0:8015
I1117 19:51:48.721858 138 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
I1117 19:51:49.489155 138 vertex_ai_server.cc:108] Vertex AI request: 0 /v2/health/ready
I1117 19:51:49.489196 138 vertex_ai_server.cc:227] Vertex AI error: 0 /v2/health/ready - 400
I1117 19:51:49.489566 138 vertex_ai_server.cc:108] Vertex AI request: 0 /v2/health/ready
I1117 19:51:49.489584 138 vertex_ai_server.cc:227] Vertex AI error: 0 /v2/health/ready - 400
...
This is repeated for 2 minutes 
...
I1117 19:53:44.693555 138 vertex_ai_server.cc:108] Vertex AI request: 0 /v2/health/ready
I1117 19:53:44.693587 138 vertex_ai_server.cc:227] Vertex AI error: 0 /v2/health/ready - 400
2023-11-17 19:53:46,696 - DEBUG - pytriton.triton: Stopping Triton Inference server and proxy backends
2023-11-17 19:53:46,696 - DEBUG - pytriton.server.triton_server: Stopping Triton Inference server - sending SIGINT signal and wait 30s
2023-11-17 19:53:46,696 - DEBUG - pytriton.server.triton_server: Waiting for process to stop
2023-11-17 19:53:49,766 - DEBUG - pytriton.server.triton_server: Triton Inference Server stopped
...
2023-11-17 16:53:50.036 raise PyTritonClientTimeoutError("Waiting for server to be ready timed out.")
2023-11-17 16:53:50.036 pytriton.client.exceptions.PyTritonClientTimeoutError: Waiting for server to be ready timed out.
jkosek commented 7 months ago

@sricke thanks for information and patience. I was able to reproduce locally the first reported error:

failed to start Vertex AI service: Invalid argument - Expect the model repository contains only a single model if default model is not specified

The issue we are seeing might be caused by some internal behavior of PyTriton in version >=0.4.0.

Please, could you try use PyTriton 0.3.1:

pip install "nvidia-pytriton==0.3.1"

And following TritonConfig:

TritonConfig(exit_on_error=True, log_verbose=log_verbose, allow_vertex_ai=True, vertex_ai_port=8015)

Let me know if that helped.

jkosek commented 7 months ago

@sricke I was wondering if the suggestions you received were helpful?

sricke commented 7 months ago

@jkosek I tried the configuration you mentioned and while the model loads correctly it eventually raises an error.

Attaching logs:

I1130 19:14:31.664455 143 vertex_ai_server.cc:108] Vertex AI request: 0 /v2/health/live
I1130 19:14:34.753330 143 vertex_ai_server.cc:108] Vertex AI request: 0 /v2/health/live
I1130 19:14:44.753448 143 vertex_ai_server.cc:108] Vertex AI request: 0 /v2/health/live
2023-11-30 16:14:50.023 Signal (2) received.
I1130 19:14:49.013554 143 server.cc:305] Waiting for in-flight requests to complete.
I1130 19:14:49.013582 143 server.cc:321] Timeout 30: Found 0 model versions that have in-flight inferences
I1130 19:14:49.013727 143 server.cc:336] All models are stopped, unloading models
I1130 19:14:49.013742 143 server.cc:343] Timeout 30: Found 1 live models and 0 in-flight non-inference requests
I1130 19:14:49.013748 143 server.cc:350] StableDiffusion_Img2Img v1: UNLOADING
I1130 19:14:49.013830 143 backend_model_instance.cc:828] Stopping backend thread for StableDiffusion_Img2Img_0...
I1130 19:14:49.013914 143 python_be.cc:2248] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I1130 19:14:50.013939 143 server.cc:343] Timeout 29: Found 1 live models and 0 in-flight non-inference requests
I1130 19:14:50.013980 143 server.cc:350] StableDiffusion_Img2Img v1: UNLOADING
I1130 19:14:50.015469 143 model.py:218] Finalizing backend instance.
I1130 19:14:50.015632 143 model.py:219] Cleaning socket and context.
I1130 19:14:50.016192 143 model.py:228] Removing allocated shared memory.
2023-11-30 19:14:52,119 - DEBUG - pytriton.server.triton_server: Triton Inference Server stopped
2023-11-30 19:14:52,119 - DEBUG - pytriton.models.manager: Clean model ('stablediffusion_img2img', 1).
...
pytriton.client.exceptions.PyTritonClientTimeoutError: Waiting for server to be ready timed out.

It seems like its getting an interrupt signal. Before that the life checks /v2/health/live seem to be returning 200, so don't know why this is happening.

jkosek commented 7 months ago

@sricke would you be able to share full log in debug mode? This would help me to see the whole process from start to failure. Thanks!

jkosek commented 7 months ago

@sricke I was able to reproduce the problem on some simple model. Will look more for the root cause. Thanks for patience.

sricke commented 7 months ago

@jkosek perfect thanks! Let me know if you find the cause.

jkosek commented 7 months ago

As for now what I am seeing is an error while PyTriton query Triton Server to get model status. This might be caused by HTTP endpoint that seems to be missing while VertexAI support is enabled.

Could you modify the TritonConfig as follow:

TritonConfig(exit_on_error=True, log_verbose=log_verbose, allow_http=True,  allow_vertex_ai=True, vertex_ai_port=8015)

Please let me know it that helped. We will also work on long term fix related with models loading.

jkosek commented 7 months ago

@sricke any update on running the solution with allow_http=True flag passed in TritonConfig? This solve the problem in the minimal example I've tested on VertexAI.

sricke commented 7 months ago

@jkosek sorry for the delayed response. I tried this specific configuration and now it works! Thanks a lot!

jkosek commented 7 months ago

Perfect!

Will keep the issue opened until we fix the loading models in future releases.

Just to repeat the WAR:

piotrm-nvidia commented 3 months ago

PyTriton 0.5.2 introduced support for VertexAI. See example for mode details.