Closed fversaci closed 5 months ago
Hey @fversaci Unfortunately, currently there's no way of prefeeding data to inputs in DALI backend. Internally we have an assumption that we don't process upcoming requests untill we send responses for all the previous ones.
We can lift that limitation and we would like to do that, because it might improve performance in various scenarios. However, this will require a significant rework of the backend, so it's hard to predict when are we going to be able to tackle this.
If you haven't experimented with this already, you might want to check the performance when you increase the number of model instances (docs). Maybe higher parallelism would help to hide the cost of fetching the data.
Hi @banasraf,
Thank you for the information (and your availability in general). I will definitely try increasing the number of model instances to see how it improves the throughput.
Regarding the issue with prefeeding Triton-DALI pipelines, I have been considering a temporary solution, while it's still not possible to prefeed them. We could provide a mega-batch (e.g., 1024 UUIDs) to the pipeline and our module could then split it into mini-batches (e.g., 8 mini-batches of size 128), and handle the prefeeding internally.
However, our current code implementing this approach is not functioning properly, since Triton expects to receive a single batch of the same size as the input batch:
E1026 13:09:54.096342 959 dali_model_instance.cc:40] Cannot split a shape list with 128 samples to list shapes of total 1024 samples.
Do you think this issue is easier to fix compared to the general prefeeding problem? In other words, can Triton-DALI handle multi-part answers to queries?
To see or test our code:
git clone https://github.com/fversaci/cassandra-dali-plugin.git -b triton
cd cassandra-dali-plugin
docker build -t cassandra-dali-plugin -f Dockerfile.triton . # this might take some time
docker run --cap-add=sys_admin --rm -it --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --name cass-dali cassandra-dali-plugin
# within the container
./start-and-fill-db.sh
./start-triton.sh # don't close the container
# new shell within the host
docker exec -ti cass-dali fish
# within the container
python3 client-triton.py
@fversaci Hey. This approach should be easier to achieve. We support similar scenario with the video input (single input file results in multiple output batches). This will require using the decoupled model (docs). Let me experiment a bit to see what needs to be adjusted to make it work in this case
Hi @banasraf,
Do you have any updates on adapting the decoupled model to our specific use case?
Meanwhile I have modified our code so that:
client-http-triton.py
, client-grpc-triton.py
, client-grpc-stream-triton.py
max_batch_size
in models/dali_cassandra/config.pbtxt
to 256, which matches the
size offered by the clients. When changing max_batch_size
to, e.g., 512, the CassandraTriton plugin automatically splits the large batches into smaller ones, which causes this error to be produced:
Cannot split a shape list with 256 samples to list shapes of total 512 samples.
Thanks!
Hi all, I wanted to provide an update on our use case. Since there is currently no general prefeeding available for the DALI backend in Triton, we have implemented internal prefetching in our plugin. We take the original batch we receive from Triton (e.g., bs=4096), split it into mini-batches (e.g., bs=256), and apply prefetching to these mini-batches.
If anyone is experiencing a similar issue, our code is available in this repository.
@fversaci - very good work. Thank you for sheering its results.
The code is now in the dev
branch, along with some (minimal) documentation:
https://github.com/fversaci/cassandra-dali-plugin/tree/dev/examples/triton
Hi,
I'm currently working on optimizing Triton inference performance for our custom C++ CassandraDali plugin.
1) As a first step, we would like to allow the batch size to vary freely (up to
max_batch_size
), similar to what the standardexternal_source
operator does. To achieve this, we have overridden theOperator::Run
methods. You can see our implementation here. It might be helpful to have a predefined way for achieving this, without the need to redefineRun
.2) Our module utilizes internal multibuffering, similar to DALI's
FileReader
. This enables us to hide high network latency while maintaining a high throughput . However, for it to work effectively, multiple input batches need to be fed to the module at once when the inference process starts (i.e., similarly to callingpipeline.feed_input
multiple times). Currently, this is not the case, as the batches are fed one by one during the inference process. This prevents the internal prefetch process from starting, which we expect to dramatically improve the throughput (about a factor 20). Is it possible to configure Triton to feed many input batches to the inference pipeline whenever they are available?3) In this regard, I have attempted to adopt the
sequence_batching
policy, hoping that it would feed multiple input batches at once to the pipeline, as it is designed to work with stateful models. However, I have been unable to configure it for our specific case. When I uncomment this line and attempt to restart the Triton server, this error occurs:Could you provide some guidance to configure
sequence_batching
for our inference pipeline?For testing our current code:
Thanks!