microsoft / azure-percept-advanced-development

Azure Percept DK advanced topics
Other
70 stars 33 forks source link

This commit adds a new time-alignment feature #38

Closed MaxStrange closed 3 years ago

MaxStrange commented 3 years ago

This commit adds a new feature that can be enabled in the module twin via "TimeAlignRTSP": "true".

This feature aligns the frames in the result RTSP stream with the results from the neural network.

When this feature is turned OFF (the default), we send out frames as soon as we get them in the raw and result channel. Therefore they look identical when viewed, except for the mark-ups. The mark-ups (like bounding boxes) meanwhile are derived from the neural network, but much less frequently, especially for high-latency networks like OpenPose, OCR, or Faster-RCNN-ResNet50. When using a high-latency network therefore, you can see that the mark-ups are applied to the wrong frame. This is due to applying the mark-ups to whatever frame we have when we get a new inference from the network. But because the network is slow, the frame that we have on hand is not the frame used by the network.

This has the advantage of being memory efficient and fast. We do not store anything but the last frame from the camera and the last inference from the network, and we send them all out as often as we get them.

But it has the disadvantage of being noticeably out of step with reality when the network has high latency, such as when viewing OpenPose results and the skeleton seems to walk behind the person it is detecting.

When this feature is turned ON, we still send out the raw frames as often as we get them, but we hold back a copy of each raw frame until we get a new neural network inference. As soon as we get a new inference, we go through the buffer of frames we are holding back and find the frame that best corresponds in time with the inference, then we apply the inference mark-ups to that frame and all older ones. Then we send those marked-up frames all in one big batch to the RTSP server.

This has the advantage of being as best time-aligned as possible, so that the mark-ups are as aligned as we can get them with the video. However, for networks with a LOT of latency, like OCR, we end up storing so many frames that we can't send all of them to the RTSP server without overwriting old frames (and therefore causing the video to jump). So to combat this, I've made it so that when we have too many frames to send to the RTSP server, we only send every Nth frame. However, this may lead to an interesting situation where the mark-ups actually jump ahead of the video and wait for the video to catch up.

The disadvantage of this feature is that 1) when it is turned on, we need a lot of memory to hold all the frames, and 2) the video is held back by an amount of time equal to about the average latency of the neural network. This is clearly visible when viewing the raw and result frames side-by-side.