Question on real-time PV streaming

iamwyh2019 commented 2 months ago

I am using hl2ss to develop a real-time CV system. I am streaming personal video at 640*360@30FPS. At some point, due to internet disturbance, the streaming seems to slow down to around 21 FPS. In this case, hl2ss seems to buffer previous frames and sent them by time order so that the delay accumulates. In my case, I'm OK with losing a few frames, but I need real-time.

I wonder where I can change to achieve this. I am thinking of two fixes, but not sure which one would work (or neither):

In stream_pv.cpp, in static void PV_Stream(SOCKET clientsocket), change videoFrameReader.AcquisitionMode(MediaFrameReaderAcquisitionMode::Buffered); to MediaFrameReaderAcquisitionMode::Realtime
Somehow call g_pSinkWriter->Flush(g_dwVideoIndex); when a new frame arrives

Any help would be appreciated!

jdibenes commented 2 months ago

Hello, I think option 2 could mess up the encoder but option 1 could work. If option 1 does not work by itself, then maybe a variable framerate approach using a semaphore could work, like ReleaseSemaphore is called after send in the SendSample callback, WaitForSingleObject with timeout 0 is called somewhere near the beginning of the FrameReceived callback and if the semaphore is acquired then the rest of the function runs and the frame is sent to the sink writer, otherwise the function returns immediately. I don't know how many frames (if any) the sink writer buffers so selecting the initial/max count for the semaphore may require some experimentation but it should be a small value like 1~16.

iamwyh2019 commented 2 months ago

Thanks for the suggestion!

Before trying the options I want to locate the specific error, i.e. if it's because the camera framerate drops (unlikely) or the video stream bandwidth drops. To do that I am adding interfaces to accept callback functions from C# side and call when a frame is sent in the C++ side. Where exactly is the part that a frame gets sent? I'm assuming it's in stream_pv.cpp, bool ok = send_multiple(user->clientsocket, wsaBuf, sizeof(wsaBuf) / sizeof(WSABUF), g_frameSentCallBack)? Or is the send_multiple (and specifically WSASend) function asynchronous and I have to pass a callback to WSASend? I'm seeing you are not using the overlapping mode so I assume that's not the case. Or I could be wrong in locating the code.

iamwyh2019 commented 2 months ago

I actually experimented for a while, logged the framerate and bandwidth of the send_multiple in PV_SendSample as well as the framerate and bandwidth of the python end receiving new frames like this:

while enable:
      stamp, data = self.video_sink.get_most_recent_frame()
      if data is not None and stamp != last_stamp:
          last_stamp = stamp
          self.get_frame_callback(data)

What I found is: the framerate and bandwidth of the PV_SendSample is consistently around 30 FPS while that of the backend drops to 20 FPS when I'm moving with Hololens. I also tried the Realtime (option 1) but the problem persists. I think that suggests that WSASend is not the part the images are physically sent. Instead it buffers the images inside the socket buffer area awaiting to be sent. Is that the case here?

jdibenes commented 2 months ago

All server sockets are blocking and sending is non-overlapped. All stream data is sent through send_multiple. WSASend should buffer according to this document https://learn.microsoft.com/en-us/previous-versions/troubleshoot/windows/win32/data-segment-tcp-winsock

To optimize performance at the application layer, Winsock copies data buffers from application send calls to a Winsock kernel buffer. Then, the stack uses its own heuristics (such as Nagle algorithm) to determine when to actually put the packet on the wire.

I also tried logging the framerate but got ~30FPS on both ends even when moving (using the Unity server sample). In my case, the HoloLens is always close to the router though. Here is the script for reference:


import multiprocessing as mp
import cv2
import hl2ss
import hl2ss_lnm
import hl2ss_mp
import hl2ss_utilities

# HoloLens address
host = '192.168.1.7'

# Camera parameters
pv_width = 640
pv_height = 360
pv_framerate = 30

# Buffer length in seconds
buffer_length = 10

if __name__ == '__main__':
    hl2ss_lnm.start_subsystem_pv(host, hl2ss.StreamPort.PERSONAL_VIDEO)

    producer = hl2ss_mp.producer()
    producer.configure(hl2ss.StreamPort.PERSONAL_VIDEO, hl2ss_lnm.rx_pv(host, hl2ss.StreamPort.PERSONAL_VIDEO, width=pv_width, height=pv_height, framerate=pv_framerate, decoded_format='bgr24'))
    producer.initialize(hl2ss.StreamPort.PERSONAL_VIDEO, pv_framerate * buffer_length)
    producer.start(hl2ss.StreamPort.PERSONAL_VIDEO)

    consumer = hl2ss_mp.consumer()
    manager = mp.Manager()
    sink_pv = consumer.create_sink(producer, hl2ss.StreamPort.PERSONAL_VIDEO, manager, None)

    frame_stamp = sink_pv.get_attach_response()

    cv2.namedWindow('Control')

    fps = None
    prev_ts = None
    delta_ts = 0
    sample_time = 1

    # Main Loop ---------------------------------------------------------------
    while (True):
        if ((cv2.waitKey(1) & 0xFF) == 27):
            break

        '''
        state, _, data_pv = sink_pv.get_buffered_frame(frame_stamp)
        if (state < 0):
            frame_stamp += 1
            continue
        if (state > 0):
            continue
        frame_stamp += 1
        '''

        stamp, data_pv = sink_pv.get_most_recent_frame()
        if data_pv is not None and stamp != frame_stamp:
            frame_stamp = stamp
        else:
            continue

        if (prev_ts is not None):
            delta_ts += data_pv.timestamp - prev_ts
        prev_ts = data_pv.timestamp

        if (fps is None):
            fps = hl2ss_utilities.framerate_counter()
            fps.reset()
        else:
            fps.increment()
            if (fps.delta() > sample_time):
                print(f'FPS: {fps.get()} / {pv_framerate} DELTA: {delta_ts / hl2ss.TimeBase.HUNDREDS_OF_NANOSECONDS} / {sample_time}')
                delta_ts = 0
                fps.reset()

        cv2.imshow('Control', data_pv.payload.image)

    sink_pv.detach()
    producer.stop(hl2ss.StreamPort.PERSONAL_VIDEO)

    hl2ss_lnm.stop_subsystem_pv(host, hl2ss.StreamPort.PERSONAL_VIDEO)

iamwyh2019 commented 2 months ago

Thank you for the reference! In my case, I am experimenting in a university department building with multiple internet access points, and my observation is the transmission slows down on the backend when I'm moving in the building, but quickly stablizes once I stand still. The FPS of calling WSASend is stable though, which can be explained by that this system call does not physically send the data, just buffers them in the socket buffer. I'll experiment more and get back to you.

iamwyh2019 commented 2 months ago

I experimented a few scenarios and here's my findings:

There are mutliple factors that could influence the latency when walking in a department building, such as switching between access points (routers), distance to the router, other processes in the Unity app, etc. They cause packet delay and loss.
Since we are streaming the video via TCP, it (1) resends loss packets, and (2) strictly ensures packets are in time order. These leads to the delay accumulating. I initially thought it was the buffering in Winsock, but turns out TCP is "too reliable" in this case regardless of the buffering.

I changed the plugin to stream over UDP, and now the delay doesn't accumulate any more. I am working on getting some concrete numbers for that, and will update here later. Anyway, great code! Doesn't require a lot of changes to switch to UDP. And if you want I can clean the code and make a pr. Thanks again for helping!

goodfella47 commented 2 months ago

I experimented a few scenarios and here's my findings:

There are mutliple factors that could influence the latency when walking in a department building, such as switching between access points (routers), distance to the router, other processes in the Unity app, etc. They cause packet delay and loss.

Since we are streaming the video via TCP, it (1) resends loss packets, and (2) strictly ensures packets are in time order. These leads to the delay accumulating. I initially thought it was the buffering in Winsock, but turns out TCP is "too reliable" in this case regardless of the buffering.

I changed the plugin to stream over UDP, and now the delay doesn't accumulate any more. I am working on getting some concrete numbers for that, and will update here later. Anyway, great code! Doesn't require a lot of changes to switch to UDP. And if you want I can clean the code and make a pr. Thanks again for helping!

Could you share your code? I'm dealing with the same problem and it would be really helpful. Thanks!

iamwyh2019 commented 2 months ago

Could you share your code? I'm dealing with the same problem and it would be really helpful. Thanks!

It's here: https://github.com/iamwyh2019/hl2ss. You can check stream_pv.cpp for the changes to the frontend logic, and hl2ss.py for the backend logic. Now, when calling hl2ss_lnm.rx_pv, instead of passing a single parameter port, you pass two parameters control_port (which transmits command over TCP) and stream_port (which streams video over UDP). For example, this is how I initialize the producer:

VIDEO_PORT = hl2ss.StreamPort.PERSONAL_VIDEO
VIDEO_UDP_PORT = hl2ss.StreamUDPPort.PERSONAL_VIDEO

producer = hl2ss_mp.producer()
producer.configure(VIDEO_PORT, hl2ss_lnm.rx_pv(host, control_port=VIDEO_PORT, stream_port=VIDEO_UDP_PORT, width=pv_width, height=pv_height, framerate=pv_framerate))

Besides this, I added a few callbacks when a frame is generated and sent via UDP. These callbacks can be exposed in the DLL and registered in C#. Check stream_pv.cpp for more details.

jdibenes commented 2 months ago

That's awesome. Thanks for sharing your solution.

iamwyh2019 commented 2 months ago

Hi Jdibenes, have you ever tried measuring the video streaming delay? I tried measuring the delay of 640x360@30FPS in this way:

when a frame arrives, replace the timestamp with GetTickCount64.
when the Python side receives it, immediately send this bytearray back via UDP.
when the C++ side receives this sent-back bytearray, parse the first 8 bytes to get the timestamp.
call GetTickCount64 to compute then round-trip delay and divide by 2.

What I got is the delay around 30~50 ms when both sends are connected to my home wifi and close to the router. It kind of seems legit but also sounds too good to be true, so wonder if you have measured it before.

jdibenes commented 2 months ago

I had the PV camera look at a stopwatch on the PC monitor, then put the PV video window next to the stopwatch, took a screenshot, and compared the time difference. I got a delay of about 270 ms for 1920x1080@30.

iamwyh2019 commented 2 months ago

Yeah! That's where I got confused. I tried the same thing (but at 640x360@30) and the delay seems to be 110ms. So I used time.time() to measure the time for unpacking, decoding, and cv2.imshow (basically everything after unpacker.unpack), but they only add up to around 12ms, so that's still off by a lot.

IMG_C50841DD48FD-1

jdibenes commented 2 months ago

Just to confirm, are you replacing the timestamp in PV_OnVideoFrameArrived or in PV_SendSample? because PV_SendSample is after the encoder stuff.

iamwyh2019 commented 2 months ago

In PV_OnVideoFrameArrived. It's around line 121 in your current version. I simply changed to pj.timestamp = GetTickCount64().

I didn't change pSample->SetSampleTime(timestamp) but that doesn't seem to matter.

jdibenes commented 2 months ago

Maybe the difference is the photons-to-PV_OnVideoFrameArrived delay. Might be able to estimate it by comparing the frame timestamp vs the QPC time when PV_OnVideoFrameArrived starts.

iamwyh2019 commented 2 months ago

Genius idea! Turns out most of the delay comes from Hololens side:

QPC time when PV_OnVideoFrameArrived starts vs. frame timestamp: ~31ms
frame timestamp vs. the very moment before calling WSASendTo: ~40ms (using GetTickCount64)

So that did adds up to around 110ms. Now the optimization work veers towards optimizing the C++ part... any suggestions?

iamwyh2019 commented 2 months ago

Hi jdibenes, would you mind sharing some details about the CustomMediaSink? including:

instance of this media sink is associated with one instance of CustomStreamSink. When is the stream sink created? And when is the pHook called? I found it took 100 ms from the start of the PV_OnFrameArrived to PV_SendSample. Not sure how this callback is scheduled.
How large is its internal buffer size? I want to increase its buffer size or drop earliest unprocessed frames.

jdibenes commented 2 months ago

Hi, CustomMediaSink and CustomStreamSink are just barebones implementations of the IMFMediaSink and IMFStreamSink interfaces and their purpose is to intercept encoded frames (IMFSample) and pass them to a callback function (pHook), all based on the model presented in https://learn.microsoft.com/en-us/windows/win32/medfound/sink-writer. The creation and configuration of the Sink Writer and its Media Sink (CustomMediaSink) are handled in custom_sink_writers.cpp L78. After that, the Media Sink is managed internally by the Sink Writer and the Media Foundation Library, including creating Stream Sinks (CustomStreamSink) and calling IMFStreamSink::ProcessSample (which calls pHook). The Sink Writer is configured to have one stream in L97 and I think this is where a single instance of CustomStreamSink is created but I'm not sure. I also have no idea how large the internal buffer size is as the library handles all these details. Finally, the video encoder generates an initial empty frame before the first video frame but I don't know if this translates to a delay of 33 ms (for 30 FPS). Here is more information about Media Sinks: https://learn.microsoft.com/en-us/windows/win32/medfound/media-sinks.

sergin3d2d commented 1 month ago

Hi @jdibenes, thanks a lot for creating such a great toolkit! I have a question that I believe is more relevant to this thread. What is the best possible latency that can be achieved when streaming from the HoloLens? We are streaming AHAT via Wi-Fi for optical tracking and are experiencing about 100-120 ms latency, even with the best possible Wi-Fi configuration. @iamwyh2019 mentioned measuring around 100 ms latency, most of which comes from the HoloLens. Is this purely a hardware limitation, or is there a way to reduce it?

iamwyh2019 commented 1 month ago

Hi @jdibenes, thanks a lot for creating such a great toolkit! I have a question that I believe is more relevant to this thread. What is the best possible latency that can be achieved when streaming from the HoloLens? We are streaming AHAT via Wi-Fi for optical tracking and are experiencing about 100-120 ms latency, even with the best possible Wi-Fi configuration. @iamwyh2019 mentioned measuring around 100 ms latency, most of which comes from the HoloLens. Is this purely a hardware limitation, or is there a way to reduce it?

I'm not sure about AHAT. For RGB camera, there's an inherent system delay (a picture is taken ==> the picture is sent) of around 80 ms. To the best of my knowledge there's no way to fix that, since Windows only allows registering callback and controls the time to call the callback itself. It could be different for AHAT though.

Adding to the system delay is the internet streaming delay, which you can estimate from your package size and bandwidth. In my case each frame is around 2KB and my WiFi is (pretty fast), so ideally each frame has a 2ms latency for transmission. It will change but most time it's really fast (<=10ms).

jdibenes / hl2ss

Question on real-time PV streaming #131