This commit adds a new feature that can be enabled in the module twin
via "TimeAlignRTSP": "true".
This feature aligns the frames in the result RTSP stream with the
results from the neural network.
When this feature is turned OFF (the default), we send out frames as
soon as we get them in the raw and result channel. Therefore they look
identical when viewed, except for the mark-ups. The mark-ups (like
bounding boxes) meanwhile are derived from the neural network, but much
less frequently, especially for high-latency networks like OpenPose,
OCR, or Faster-RCNN-ResNet50. When using a high-latency network
therefore, you can see that the mark-ups are applied to the wrong frame.
This is due to applying the mark-ups to whatever frame we have when we
get a new inference from the network. But because the network is slow,
the frame that we have on hand is not the frame used by the network.
This has the advantage of being memory efficient and fast. We do not
store anything but the last frame from the camera and the last inference
from the network, and we send them all out as often as we get them.
But it has the disadvantage of being noticeably out of step with reality
when the network has high latency, such as when viewing OpenPose results
and the skeleton seems to walk behind the person it is detecting.
When this feature is turned ON, we still send out the raw frames as
often as we get them, but we hold back a copy of each raw frame until we
get a new neural network inference. As soon as we get a new inference,
we go through the buffer of frames we are holding back and find the
frame that best corresponds in time with the inference, then we apply
the inference mark-ups to that frame and all older ones. Then we send
those marked-up frames all in one big batch to the RTSP server.
This has the advantage of being as best time-aligned as possible,
so that the mark-ups are as aligned as we can get them with the video.
However, for networks with a LOT of latency, like OCR, we end up storing
so many frames that we can't send all of them to the RTSP server without
overwriting old frames (and therefore causing the video to jump). So to
combat this, I've made it so that when we have too many frames to send
to the RTSP server, we only send every Nth frame. However, this may lead
to an interesting situation where the mark-ups actually jump ahead of
the video and wait for the video to catch up.
The disadvantage of this feature is that 1) when it is turned on, we
need a lot of memory to hold all the frames, and 2) the video is held
back by an amount of time equal to about the average latency of the
neural network. This is clearly visible when viewing the raw and result
frames side-by-side.
This commit adds a new feature that can be enabled in the module twin via "TimeAlignRTSP": "true".
This feature aligns the frames in the result RTSP stream with the results from the neural network.
When this feature is turned OFF (the default), we send out frames as soon as we get them in the raw and result channel. Therefore they look identical when viewed, except for the mark-ups. The mark-ups (like bounding boxes) meanwhile are derived from the neural network, but much less frequently, especially for high-latency networks like OpenPose, OCR, or Faster-RCNN-ResNet50. When using a high-latency network therefore, you can see that the mark-ups are applied to the wrong frame. This is due to applying the mark-ups to whatever frame we have when we get a new inference from the network. But because the network is slow, the frame that we have on hand is not the frame used by the network.
This has the advantage of being memory efficient and fast. We do not store anything but the last frame from the camera and the last inference from the network, and we send them all out as often as we get them.
But it has the disadvantage of being noticeably out of step with reality when the network has high latency, such as when viewing OpenPose results and the skeleton seems to walk behind the person it is detecting.
When this feature is turned ON, we still send out the raw frames as often as we get them, but we hold back a copy of each raw frame until we get a new neural network inference. As soon as we get a new inference, we go through the buffer of frames we are holding back and find the frame that best corresponds in time with the inference, then we apply the inference mark-ups to that frame and all older ones. Then we send those marked-up frames all in one big batch to the RTSP server.
This has the advantage of being as best time-aligned as possible, so that the mark-ups are as aligned as we can get them with the video. However, for networks with a LOT of latency, like OCR, we end up storing so many frames that we can't send all of them to the RTSP server without overwriting old frames (and therefore causing the video to jump). So to combat this, I've made it so that when we have too many frames to send to the RTSP server, we only send every Nth frame. However, this may lead to an interesting situation where the mark-ups actually jump ahead of the video and wait for the video to catch up.
The disadvantage of this feature is that 1) when it is turned on, we need a lot of memory to hold all the frames, and 2) the video is held back by an amount of time equal to about the average latency of the neural network. This is clearly visible when viewing the raw and result frames side-by-side.