triton-inference-server / dali_backend

The Triton backend that allows running GPU-accelerated data pre-processing pipelines implemented in DALI's python API.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
MIT License
118 stars 28 forks source link

Prefeed multiple input batches to the inference pipeline #219

Closed fversaci closed 5 months ago

fversaci commented 9 months ago

Hi,

I'm currently working on optimizing Triton inference performance for our custom C++ CassandraDali plugin.

1) As a first step, we would like to allow the batch size to vary freely (up to max_batch_size), similar to what the standard external_source operator does. To achieve this, we have overridden the Operator::Run methods. You can see our implementation here. It might be helpful to have a predefined way for achieving this, without the need to redefine Run.

2) Our module utilizes internal multibuffering, similar to DALI's FileReader. This enables us to hide high network latency while maintaining a high throughput . However, for it to work effectively, multiple input batches need to be fed to the module at once when the inference process starts (i.e., similarly to calling pipeline.feed_input multiple times). Currently, this is not the case, as the batches are fed one by one during the inference process. This prevents the internal prefetch process from starting, which we expect to dramatically improve the throughput (about a factor 20). Is it possible to configure Triton to feed many input batches to the inference pipeline whenever they are available?

3) In this regard, I have attempted to adopt the sequence_batching policy, hoping that it would feed multiple input batches at once to the pipeline, as it is designed to work with stateful models. However, I have been unable to configure it for our specific case. When I uncomment this line and attempt to restart the Triton server, this error occurs:

E1019 12:12:27.175930 1607 model_lifecycle.cc:621] failed to load 'dali_cassandra' version 1: Invalid argument: : invalid value oneof field 'scheduling_choice' is already set. Cannot set 'dynamic_batching' for type oneof

Could you provide some guidance to configure sequence_batching for our inference pipeline?

For testing our current code:

git clone https://github.com/fversaci/cassandra-dali-plugin.git -b triton
cd cassandra-dali-plugin
docker build -t cassandra-dali-plugin -f Dockerfile.triton .   # this might take some time
docker run --cap-add=sys_admin --rm -it --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --name cass-dali cassandra-dali-plugin
# within the container
./start-and-fill-db.sh
./start-triton.sh   # don't close the container
# new shell within the host
docker exec -ti cass-dali fish
# within the container
perf_analyzer -m dali_cassandra --input-data uuids.json -b 512 --concurrency-range 2

Thanks!

banasraf commented 9 months ago

Hey @fversaci Unfortunately, currently there's no way of prefeeding data to inputs in DALI backend. Internally we have an assumption that we don't process upcoming requests untill we send responses for all the previous ones.

We can lift that limitation and we would like to do that, because it might improve performance in various scenarios. However, this will require a significant rework of the backend, so it's hard to predict when are we going to be able to tackle this.

If you haven't experimented with this already, you might want to check the performance when you increase the number of model instances (docs). Maybe higher parallelism would help to hide the cost of fetching the data.

fversaci commented 9 months ago

Hi @banasraf,

Thank you for the information (and your availability in general). I will definitely try increasing the number of model instances to see how it improves the throughput.

Regarding the issue with prefeeding Triton-DALI pipelines, I have been considering a temporary solution, while it's still not possible to prefeed them. We could provide a mega-batch (e.g., 1024 UUIDs) to the pipeline and our module could then split it into mini-batches (e.g., 8 mini-batches of size 128), and handle the prefeeding internally.

However, our current code implementing this approach is not functioning properly, since Triton expects to receive a single batch of the same size as the input batch:

E1026 13:09:54.096342 959 dali_model_instance.cc:40] Cannot split a shape list with 128 samples to list shapes of total 1024 samples.

Do you think this issue is easier to fix compared to the general prefeeding problem? In other words, can Triton-DALI handle multi-part answers to queries?

To see or test our code:

git clone https://github.com/fversaci/cassandra-dali-plugin.git -b triton
cd cassandra-dali-plugin
docker build -t cassandra-dali-plugin -f Dockerfile.triton .   # this might take some time
docker run --cap-add=sys_admin --rm -it --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --name cass-dali cassandra-dali-plugin
# within the container
./start-and-fill-db.sh
./start-triton.sh   # don't close the container
# new shell within the host
docker exec -ti cass-dali fish
# within the container
python3 client-triton.py
banasraf commented 9 months ago

@fversaci Hey. This approach should be easier to achieve. We support similar scenario with the video input (single input file results in multiple output batches). This will require using the decoupled model (docs). Let me experiment a bit to see what needs to be adjusted to make it work in this case

fversaci commented 8 months ago

Hi @banasraf,

Do you have any updates on adapting the decoupled model to our specific use case?

Meanwhile I have modified our code so that:

  1. It now has three client implementations to play with: client-http-triton.py, client-grpc-triton.py, client-grpc-stream-triton.py
  2. The model produces a reduced output instead of the full tensors. This means that the bottleneck during testing is no longer on the Python clients, but rather in the Triton server pipeline. As a result, the throughput is much higher than before.
  3. I set the default max_batch_size in models/dali_cassandra/config.pbtxt to 256, which matches the size offered by the clients. When changing max_batch_size to, e.g., 512, the CassandraTriton plugin automatically splits the large batches into smaller ones, which causes this error to be produced:
    Cannot split a shape list with 256 samples to list shapes of total 512 samples.
  4. The plugin now logs the input size of each batch it receives and the current status of its internal prefetching mechanism.

Thanks!

fversaci commented 6 months ago

Hi all, I wanted to provide an update on our use case. Since there is currently no general prefeeding available for the DALI backend in Triton, we have implemented internal prefetching in our plugin. We take the original batch we receive from Triton (e.g., bs=4096), split it into mini-batches (e.g., bs=256), and apply prefetching to these mini-batches.

If anyone is experiencing a similar issue, our code is available in this repository.

JanuszL commented 5 months ago

@fversaci - very good work. Thank you for sheering its results.

fversaci commented 5 months ago

The code is now in the dev branch, along with some (minimal) documentation: https://github.com/fversaci/cassandra-dali-plugin/tree/dev/examples/triton