Using TimeDistributed input layer with Mediapipe

atiselsts commented 3 years ago

Hello,

I'm using mediapipe with a custom model. Currently the model takes a single image as input, but I want to change the model by adding a TimeDistributed layer to the network. The new network will take 5 images as input. Is it possible to pass the correct input data to the model with Mediapipe calculators?

The current input shape is (1, 240, 320, 3), it will be changed to (1, 5, 240, 320, 3).

The relevant part of the current graph is as follows:

node {
  calculator: "ImageToTensorCalculator"
  input_stream: "IMAGE_GPU:cropped_image"
  output_stream: "TENSORS:input_tensor"
  output_stream: "LETTERBOX_PADDING:letterbox_padding"
  options: {
    [mediapipe.ImageToTensorCalculatorOptions.ext] {
      output_tensor_width: 320
      output_tensor_height: 240
      keep_aspect_ratio: true
      output_tensor_float_range {
        min: 0
        max: 255
      }
      border_mode: BORDER_ZERO
      gpu_origin: TOP_LEFT
    }
  }
}

node {
  calculator: "InferenceCalculator"
  input_stream: "TENSORS:input_tensor"
  output_stream: "TENSORS:classification_tensor"
  options: {
    [mediapipe.InferenceCalculatorOptions.ext] {
      model_path: "mediapipe/modules/hand_washing/washing_movements.tflite"
    }
  }
}

sgowroji commented 3 years ago

Hi @atiselsts, Thanks for reaching out to us regarding your query. Could you provide your use case with and the solution details your referring. Thanks!

atiselsts commented 3 years ago

We want to use a custom model for hand gesture recognition. Is this an appropriate question to ask here?

Here is my current code, if you're interested in more details: https://github.com/atiselsts/mediapipe/blob/feature/washing-movements-app/mediapipe/modules/hand_washing/washing_movements_gpu.pbtxt

It works fine, but in order to increase the recognition accuracy, I'm interested in replacing the current model with another model that accepts 5 input images at once (i.e. works on a temporal sequence of images).

Is there a ready-made calculator I could use for this, or you have any tips in writing a custom calculator?

atiselsts commented 3 years ago

I can probably write a workaround - convert the IMAGE_GPU to ImageFrame, stack 5 image frames together, then convert back to IMAGE_GPU, which then can be converted to a tensor in the GPU memory and fed to the network, which uses a reshape layer to convert the input back to the format it needs.

Nevertheless, the functionality to merge multiple tensors directly in the GPU memory would be cool!

eknight7 commented 3 years ago

Adding @mcclanahoochie to suggest how to best solve this for stacking images/tensors as input to the model via MediaPipe.

mcclanahoochie commented 3 years ago

it seems like you want to create one large image that contains 5 images stacked/concatenated (a "height*5 x width" image, and not a 20-channel image of regular size), right?

that is a very natural way to do batching, but unfortunately mediapipe doesn't offer any calculators for this right now.

the method you described seems ok, but may run into the issue of overflowing the max allowed GL texture dimensions (typically around 4k), so you should "stack"/concatenate the smaller, tensor-sized images(320x240) to be safer (also better for bandwidth). what you propose should work, and is maybe the simplest way to test something, but ideally the cpu should be avoided if you are comfortable writing GLSL shader code.

one alternative is to write code to stack tensors instead of images (i.e. hacking the ImageToTensorCalculator) , but again, that would involve some compute shaders.

another alternative, if you are in control of the model, is to accept 5 seperate gpu images as inputs, then do the concatenation in the model as a pre-processing step

atiselsts commented 3 years ago

another alternative, if you are in control of the model, is to accept 5 seperate gpu images as inputs, then do the concatenation in the model as a pre-processing step

Can you show how to implement this as input? I tried passing multiple tensors to the model with indices (TENSORS:0:name1, TENSORS:1:name2) but that did was not accepted by the inference calculator.

atiselsts commented 3 years ago

it seems like you want to create one large image that contains 5 images stacked/concatenated (a "height*5 x width" image, and not a 20-channel image of regular size), right?

Yes. I implemented this for CPU yesterday and it works ok indeed. Don't have any experience writing GPU code unfortunately. https://github.com/atiselsts/mediapipe/blob/feature/washing-movements-app-time-distributed/mediapipe/calculators/image/image_merge_calculator.cc

mcclanahoochie commented 3 years ago

Great!

The inference calculator(s) should support a vector of tensor inputs, and the order they are in gets directly mapped to the model inputs 0-N , so there is no way to specify named or out of order inputs.
The image to tensor converter calculator indeed only works with one image/tensor at a time though, so i guess you would need to either modify that, or have a custom calculator that accumulates those vectors and sends out the batch. ^This all assumes your model is modified to accept 5/N separate input tensors instead of the concatenated one.

atiselsts commented 3 years ago

@mcclanahoochie Thanks - using multiple inputs would achieve what I want without modifying the model.

I'm not sure how to write that though? I tried this, but got a runtime error

node {
  calculator: "InferenceCalculator"
  input_stream: "TENSORS:input_tensor1"
  input_stream: "TENSORS:input_tensor2"
  input_stream: "TENSORS:input_tensor3"
  input_stream: "TENSORS:input_tensor4"
  input_stream: "TENSORS:input_tensor5"
  output_stream: "TENSORS:classification_tensor"
  options: {
    [mediapipe.InferenceCalculatorOptions.ext] {
      model_path: "mediapipe/modules/hand_washing/washing_movements.tflite"
    }
  }
}

The error is:

2021-04-26 11:54:57.096 7430-7720/com.google.mediapipe.apps.washingmovements E/native: E20210426 11:54:57.096338 7720 graph.cc:471] while processing the input streams of subgraph node InferenceCalculator: ; tag "TENSORS" index 0 already had a name "washingmovementsgpu__inferencecalculator__washingmovementsgpu__input_tensor1" but is being reassigned a name "washingmovementsgpu__inferencecalculator__washingmovementsgpu__input_tensor2"

eknight7 commented 3 years ago

Hi, when there are multiple inputs to the same tag, we can format the inputs as follows:

node {
  calculator: "InferenceCalculator"
  input_stream: "TENSORS:0:input_tensor1"
  input_stream: "TENSORS:1:input_tensor2"
  input_stream: "TENSORS:2:input_tensor3"
  input_stream: "TENSORS:3:input_tensor4"
  input_stream: "TENSORS:4:input_tensor5"
  output_stream: "TENSORS:classification_tensor"
  options: {
    [mediapipe.InferenceCalculatorOptions.ext] {
      model_path: "mediapipe/modules/hand_washing/washing_movements.tflite"
    }
  }
}

mcclanahoochie commented 3 years ago

The InferenceCalculator accepts a vector of tensors, which should be ordered according to what your model accepts.

What eknight7 says is true in general (specifying a index #)for multiple input streams , but for the InferenceCalculator/TfLiteInferenceCalculator a single input stream is used that contains multiple tensors.

atiselsts commented 3 years ago

Got it, thanks.

google-ai-edge / mediapipe

Using TimeDistributed input layer with Mediapipe #1898