Persistent Handler for Streaming Data

edmuthiah commented 3 years ago

Is your feature request related to a problem? Please describe.

I'm looking to pass multiple feeds of sequential data from local instances to torchserve running on a remote instance. The current handlers such as the object detector seem to distribute the incoming requests to different workers and run the handler but consider the incoming data to be independent of previous data. I'm looking to understand how I can create a persistent handler.

Describe the solution

For example, say I have sequential data streams A, B and C posting data to the torchserver. I would like to persist information from the results of the previous handler call and update it at every timestep as new data comes in sequentially. Each worker must only receive information from the stream (A, B or C) that it had previously received an request from.

A use case for this is speech-to-text or object tracking. Where new inputs words or images relate to prior inputs.

@harshbafna @HamidShojanazeri

msaroufim commented 3 years ago

Hi @ed-muthiah do you feel like your use case would be covered by increasing the batch size of your model and taking care of using multiple outputs in the inference function of a handler? https://github.com/aws-samples/amazon-sagemaker-endpoint-deployment-of-fastai-model-with-torchserve/blob/main/deployment/handler.py#L85

If that doesn't work you can also write your inferences in a file in the inference function and load them in subsequent calls to the handler

Let me know if you'd like more detail since I think it'd be cool to have an example like this in the repo

edmuthiah commented 3 years ago

@msaroufim Hey Mark, I think the second method of writing every inference to file and reloading it back might significantly impact inference times compared to having the previous state available in memory.

With regards to batch processing could you please provide a bit more detail as to how this could accomodate sequential information? If set the batch size as 2, to use to consecutive frames in a video, but there are 4 workers, how would we ensure that the correct worker receives the correct sequential frame?

I had a look through the handler you linked, seems to me like a standard handler without batch processing or any write/loading to file (apart from the .pth).

I'd be happy to contribute to this as it's probably going to be a common request as people try out any model that uses series data.

HamidShojanazeri commented 3 years ago

@ed-muthiah this is an interesting use-case, Torchserve currently does not support linking data to a specific worker, its in our roadmap to support finer control for users in future versions, however the plan is so far about finer control on device level (selecting the gpu:id).

I wonder if for your current use-case as a workaround, there is a way to sign your input (similar to this sample) and return the respected response (similar to here), then post processing/aggregating inference can be done on the client-side.

We welcome your proposal and happy to discuss it further.

edmuthiah commented 3 years ago

Hi @HamidShojanazeri and @msaroufim thanks for taking an interest in this issue. Personally I'm working with sequential images from a video. Would you have any idea how we could sign an image?

Furthermore, I'm not sure I understand what signing achieves. If you consider three different streams of images, lets say some images are signed "Stream = A", "Stream = B" and "Stream = C".

How would this distribute to the worker that previously worked with A images?
Even if we managed to distribute to the correct worker, once a single run of the handler script is complete, then there is no persistence of the previous frame and inference result. Because the handler doesn't keep running and wait for a second frame to come through.

Personally also I think aggregation of inference should stay on the server-side. Consider the scenario when you are working with a CPU only machine and sending sequential video frames to a remote GPU enabled TorchServer. Ideally you would simply want images to be sent to the server and receive back JSON inference results. For example in object tracking you would only want to receive back what objects are in the image, their identity and where they are now located compared to the previous frame. Managing this on torchserve side greatly reduces the compute requirements on the client.

This issue is applicable to any sequential data but I'd like to focus on tracking by detection to narrow the scope of what we're discussing:

HamidShojanazeri commented 3 years ago

@ed-muthiah I see,I think I missed the part that you need the previous frames predictions contributing to the next frame. My suggestion was a to find a work around to associate the responses from stream A, being served on different workers.

As all the workers serving the same model, so the idea of signature just make sure that the inference resulted from any worker can be associated back to the right data stream, it originated from. For example, if you have two video frames from stream A coming in different batches(or in a mixed batch of data from Stream A and B) , one frame from Stream A goes to worker1 and the other one goes to worker2, then the returned response is recognizable in respect with original stream.

For sure, it's ideal to persist it on the server side, where in this setting would effect the handler time. With post-processing, on client side what I meant was aggregating predictions related to each frame of a video, which I think in this case might not help.

HamidShojanazeri commented 3 years ago

CC @maaquib

edmuthiah commented 3 years ago

Gotcha, now I get what you mean. But yes, as you said, the sequential nature of this makes it a bit tougher.

A few comments on what I think we need to set up a working example:

Worker Affinity - Some way of binding a stream to a worker. Is it possible to allocate a specific worker to respond to request on a specific port? This is not an elegant solution, but means that we could manually assign StreamA to Worker1.
Sequential Processing - The handler needs to receive have 2 sequential frames from a video stream. The order of processing matters here so I'm not sure if batching is the correct solution. Assuming that worker affinity (1.) is possible, we would need to process the first received frame, storing both the image and inference result in memory (I don't think writing to file and loading will be quick), and perhaps something like a wait until next frame is received. This would essentially keep the handler running in a loop processing frames as the coming in and outputting inference results.

There might be a few flaws in my understanding of TorchServe, as I'm not sure if the handler script must run fully to completion so that TorchServe actually outputs anything so please so please let me know :)

@HamidShojanazeri @msaroufim @maaquib

msaroufim commented 3 years ago

What about an in memory database like redis or memcached? Should work in principle similarly to writing to a file without the slowdown from reading and writing to disk.

edmuthiah commented 3 years ago

This could work, I'll investigate a bit more.

edmuthiah commented 3 years ago

@msaroufim Looks like Redis cannot be used in a stream processing or ML engine, as per the license agreement: https://redislabs.com/wp-content/uploads/2019/09/redis-source-available-license.pdf

Could someone please confirm if TorchServe is considered a "(f) machine learning or deep learning or artificial intelligence serving engine;"

pytorch / serve

Persistent Handler for Streaming Data #1127

Is your feature request related to a problem? Please describe.

Describe the solution