triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.4k stars 1.49k forks source link

Interaction of timeouts, ensemble scheduler and oldest sequence scheduler causes state leakage #7117

Open jamied157 opened 7 months ago

jamied157 commented 7 months ago

Description We're still in progress in diagnosing this issue but we're seeing some strange behaviour and would like to get some feedback/ideas from others.

We're using a setup that relies on two separate models using the oldest sequence scheduler combined with the ensemble model. Essentially

Input -> Model_1 -> Model_2 -> Output

Where model 1 and model 2 use implicit state management with the PyTorch backend, they have an initial state but then use the output from the model to update the state.

Our issue tends to arise when the triton is under load, if we use up all the candidate sequence slots - then we don't send END flags on any of them and instead let triton time them out. We then see that Model_2 is in a bad state, in this state, if we run requests against model 2, it's possible to leak information across requests and corrupt the outputs of the model. It tends to look something like:

image

Here we're displaying the outputs of the model, we've corrupted one batch with NaN values and used random inputs for others - however the NaN values leak into the next request sent to the server (here the batches are run sequentially so leaking happens across executions rather than within one)

Given all this, it feels like something isn't quite working correctly with the interaction between the ensemble, the sequence timeout and implicit state management.

Triton Information What version of Triton are you using?

We've reproed on NGC version 22.12, 23.10, 24.02

Are you using the Triton container or did you build it yourself?

We've reproed on containers based on the triton container and on versions built by us.

To Reproduce We're working on this, currently we only can reproduce with our own models and configuration which we can't share. Hopefully we can find a repro with some simpler models. The basic setup is two sequential models using implicit state management with the PyTorch backend - connected together with the ensemble scheduler. The you need to timeout some/all of the candidate sequences to get Model_2 into a bad state. Once Model_2 is in this state you can fairly reliably see the state corruption reported above.

We still can't reliably reproduce this state however so we can't provide much more info.

Expected behavior We shouldn't be able to affect the outputs of the model with the inputs from a different client.

jamied157 commented 7 months ago

Okay have managed to create a reproduction independent of our setup. Firstly you need to define a torch model like so

import torch
import torch.nn as nn

class Model(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, inputs, state):
        print("Executing with; ", inputs, state)
        inputs = inputs + state
        state = inputs + state
        return inputs, state

def main():
    scripted_model = torch.jit.script(Model())
    torch.jit.save(scripted_model, "simple_model.pt")

if __name__ == "__main__":
    main()

Then create two models with identical model configs

name: "simple_model"
max_batch_size: 1
input {
  name: "INPUT__0"
  data_type: TYPE_FP16
  dims: 5
}
output {
  name: "OUTPUT__0"
  data_type: TYPE_FP16
  dims: 5
}
sequence_batching {
  state {
    input_name: "INPUT__1"
    output_name: "OUTPUT__0"
    data_type: TYPE_FP16
    dims: 5
    initial_state: {
      data_type: TYPE_FP16
      dims: 5
      zero_data: true
      name: "initial state"
    }
  }
  max_sequence_idle_microseconds: 10000000
  oldest {
    max_queue_delay_microseconds: 50000000
    max_candidate_sequences: 2
  }
}
backend: "pytorch"

And join them together with this ensemble model

name: "ensemble_model"
platform: "ensemble"
max_batch_size: 1
input {
  name: "INPUT__0"
  data_type: TYPE_FP16
  dims: 5
}
output {
  name: "OUTPUT__0"
  data_type: TYPE_FP16
  dims: 5
}
ensemble_scheduling {
  step {
    model_name: "simple_model"
    model_version: -1
    input_map {
      key: "INPUT__0"
      value: "INPUT__0"
    }
    output_map {
      key: "OUTPUT__0"
      value: "SHARED_INPUT"
    }
  }
  step {
    model_name: "simple_model_2"
    model_version: -1
    input_map {
      key: "INPUT__0"
      value: "SHARED_INPUT"
    }
    output_map {
      key: "OUTPUT__0"
      value: "OUTPUT__0"
    }
  }
}

Launch a triton server and target it with this script: https://gist.github.com/jamied157/a047d6314d83499c9811a642a895ad2a

This launches 3 clients against the server, two of them send 0's and one sends nan inputs.

Then

  1. In one terminal, run the script until the responses for 2 of the clients are all returned (will return something like Sequence: 208590 hit output_count: 6 twice.
  2. Then run the script in another terminal
  3. Cancel the original script and rerun

In the original terminal you should eventually get a message like

Value 0, sequence 654216, returned a different output, got [[nan nan nan nan nan]]

In the logs you should also be able to see a line like

Executing with;   0  0  0  0  0
[ CUDAHalfType{1,5} ] nan nan nan nan nan
[ CUDAHalfType{1,5} ]

This shows the state for one of the zero sequences has been contaminated with nans.

jamied157 commented 7 months ago

I also have logs from log_verbose=1 which I've attached simple_triton_output.log

jamied157 commented 7 months ago

Quick update on this, we think we've found the issue. We're seeing multiple calls to SequenceBatchScheduler::ReleaseSequenceSlot: https://github.com/triton-inference-server/core/blob/main/src/sequence_batch_scheduler/sequence_batch_scheduler.cc#L880 for the same sequence slot before the slot can be cleaned up correctly.

This results in duplicate sequence slots being added to ready_batcher_seq_slots_ here: https://github.com/triton-inference-server/core/blob/main/src/sequence_batch_scheduler/sequence_batch_scheduler.cc#L958

HennerM commented 7 months ago

@Tabrizian I have found the root cause and it isn't just affecting ensemble models. https://github.com/triton-inference-server/core/pull/341. but potentially all sequence models that use implicit state and are accessed with the gRPC client TL;DR: gRPC request cancellation trigger a slot release in the sequence batcher, but the very same slot will still be released by the sequence batche reaper thread, leading to potentially having new clients share the same slot.

Tabrizian commented 7 months ago

Thanks for proposing a fix @HennerM and filing a detailed a GitHub issue @jamied157. We'll take a look at this and get back to you.

jamied157 commented 7 months ago

Just to add a bit more detail:

  1. @HennerM's fix above covers the case where we free a sequence slot twice, once with a cancellation and once with the reaper thread.
  2. My Repro above can cause a slightly different issue where cancelled requests from the ensemble can free a sequence multiple times
  3. We're still seeing some issues with versions of triton before 23.10 where cancellation wasn't implemented (although can't repro on a minimal setup yet) so if you have any ideas we'd be keen to hear!
rmccorm4 commented 7 months ago

Hi @jamied157,

Thanks for such detailed repro steps and investigation!

I had a quick follow-up question from your description so far. You mentioned that you can reproduce this without request cancellation:

We're still seeing some issues with versions of triton before 23.10 where cancellation wasn't implemented

But it looks your repro steps involve triggering a request cancellation by closing the client connection from the original grpc client script:

  1. Then run the script in another terminal
  2. Cancel the original script and rerun

Just to clarify - is the reproducible behavior the same prior to 23.10? Or are there different symptoms and triggers?

rmccorm4 commented 7 months ago

Another question @jamied157 @HennerM - given Henner's proposed fix for the sequence batcher issue for a standalone sequence model (no ensemble), and:

My Repro above can cause a slightly different issue where cancelled requests from the ensemble can free a sequence multiple times

Do you have any kind of ensemble repro for an ensemble of just 1 model, rather than a pipeline of 2 for the sake of something more minimal?

HennerM commented 7 months ago

I have repro Python script which should work if you have pytorch and tritonserver installed: https://gist.github.com/HennerM/6ae28bc0360aae96bc5feb422aa1375c

or if you already have the model repository setup that Jamie mentioned above then you should be able to run a variation of this:

import asyncio
import functools
import gc
import os
from typing import Optional

import numpy as np
from tritonclient.grpc import InferenceServerClient as SyncInferenceServerClient
from tritonclient.grpc.aio import InferenceServerClient, InferInput, InferResult

def create_input(seq_id: int):
    input = [InferInput("INPUT__0", [1, FIXED_LAST_DIM], "INT64")]
    input[0].set_data_from_numpy( np.full([1, FIXED_LAST_DIM], fill_value=seq_id, dtype=np.int64))
    return input

def check_result(seq_id: int, res: asyncio.Future[Optional[InferResult]]):
    try:
        r = res.result()
        assert r is not None
        np_res = r.as_numpy("OUTPUT__0")
        assert np_res is not None
        if not np.all(np_res % seq_id == 0):
            print(f"====CROSSOVER==== {np_res=} for {seq_id=}, triton_result", flush=True)
            sys.exit(1)
    except asyncio.CancelledError:
        pass

async def continuous_inference(
    client: InferenceServerClient,
    model,
    offset: int,
    candidates=1,
    max_requests=10,
    time_between_requests=MAX_IDLE_SECONDS * 0.5,
    end_on_cancel: bool = True,
):
    for seq_id in range(offset, offset + candidates):
        print(f"Starting stream {seq_id}")
        t = asyncio.create_task(
            client.infer(
                model,
                create_input(seq_id),
                sequence_start=True,
                sequence_id=seq_id,
                request_id=f"{seq_id}_init",
            )
        )
        t.add_done_callback(functools.partial(check_result, seq_id))
    await asyncio.sleep(time_between_requests)
    for req_id in range(max_requests):
        for seq_id in range(offset, offset + candidates):
            print(f"Sending request {req_id} for {seq_id}")
            t = asyncio.create_task(
                client.infer(
                    model,
                    create_input(seq_id),
                    sequence_id=seq_id,
                    request_id=f"{seq_id}_{req_id}",
                    sequence_end=req_id==max_requests-1,
                )
            )
            t.add_done_callback(functools.partial(check_result, seq_id))
        await asyncio.sleep(time_between_requests)

TARGET_MODEL = "simple_model"

async def simple_cancellation():
    print("Wait for Triton to be ready")
    await asyncio.sleep(2)

    client = InferenceServerClient(TRITON_URL)
    streams_1 = asyncio.create_task(
        continuous_inference(client, TARGET_MODEL, 100, candidates=1, max_requests=5, time_between_requests=0.1)
    )
    await asyncio.sleep(0.3)
    print("Cancelling stream")
    streams_1.cancel()
    await client.close()
    del client
    gc.collect()

    print("Wait for timeout")
    await asyncio.sleep(delay=MAX_IDLE_SECONDS + 2)
    bob = InferenceServerClient(TRITON_URL)
    print("Starting new streams")
    await asyncio.create_task(
        continuous_inference(bob, TARGET_MODEL, offset=200, candidates=4, max_requests=5, time_between_requests=1)
    )
HennerM commented 7 months ago

Just to clarify - is the reproducible behavior the same prior to 23.10? Or are there different symptoms and triggers?

We only have reproduced similar behaviour, that is multiple sequences getting assigned the same slot and state, in our internal production Triton setup, and didn't manage yet to get this into a min-repro. We wil get back to you in case we manage to reproduce without cancellation.

rmccorm4 commented 7 months ago

Thanks @HennerM, appreciate the diligence here!

So the current status is:

  1. You have a proposed fix for the sequence cancellation scenario which we can help review, test, and get in: https://github.com/triton-inference-server/core/pull/341
  2. You still encounter a similar issue even with this cancellation fix, possibly related to ensembles, but don't yet have a minimal reproducer.

Is that right?

In the meantime, I can help take a look into (2) to see if I can find anything that looks logically related, but a reproducer would greatly help expedite that.

HennerM commented 7 months ago

Thanks a lot. Definitely appreciate the help here. On (1) the review and test help will be good, as I struggled in the past to get the test setup working on my Maschine.

on (2) I need to clarify that we actually haven’t noticed the wrong behaviour with the patch above yet, it’s just that the patch deals with a problem with cancellation so we have a belief that this isn’t quite fixing the problem. for us the sequence batcher code, potentially couple with ensembles is quite complex and there are a lot of edge cases that can happen in that combination. We just want to be double sure that no edge cases are missed.

rmccorm4 commented 7 months ago

Hi @HennerM, we're looking into this single-model reproducer you shared above, but have a few questions.

  1. The code doesn't run as-is, it required a few best-guess modifications to try it out. Can you upload a complete script runnable as-is, just to make sure we're running the same thing? My local edits to get something running are below for reference (it also required modifying the config.pbtxt from TYPE_BF16 to TYPE_INT64 (or np.float16 vice versa),and I otherwise used the simple_model.pt as-is:

    < import sys
    12,15d10
    < MAX_IDLE_SECONDS=10
    < TRITON_URL="localhost:8001"
    < FIXED_LAST_DIM=5
    < 
    95,97d89
    < 
    < if __name__ == "__main__":
    <     asyncio.run(simple_cancellation())
  2. I'm assuming this check is supposed to get triggered for the reproduction, but I ran the script quite a few times on 24.03 release and never hit this case. Am I missing some steps to reproduce?

            print(f"====CROSSOVER==== {np_res=} for {seq_id=}, triton_result", flush=True)

Having this simpler single-model reproducer that does cancellation in-line would be really helpful for further debugging, so any extra details you can share or step-by-step instructions with complete scripts to get it working would be very helpful!

HennerM commented 7 months ago

Sorry I am on UK hours, here is an updated script: https://gist.github.com/HennerM/cde5037442a2efa4238246cc5ccb7b51

I think I didn't leave enough time for the triton server to start, let me know if you have more questions

rmccorm4 commented 7 months ago

Thanks @HennerM! I was able to reproduce the crossover with your updated gist. Will reach out if we have more questions.