Understand why reported niq latency is above target

roc-streaming / roc-toolkit

Real-time audio streaming over the network.

https://roc-streaming.org

Mozilla Public License 2.0

1.03k stars 205 forks source link

Understand why reported niq latency is above target #548

Open gavv opened 10 months ago

gavv commented 10 months ago

Commands:

roc-send -vv -s rtp+rs8m://lo:10001 -r rs8m://lo:10002 -c rtcp://lo:10003 -i file:stash/long.wav

roc-recv -vv -s rtp+rs8m://lo:10001 -r rs8m://lo:10002 -c rtcp://lo:10003

In roc-recv output, you will see that niq_latency is growing to 210ms and then is slowly floating around that value. Expected behavior is that it would float around 200ms (which is target latency; you can change it with --sess-latency).

However, if I dump all inputs and outputs of FreqEstimator right after its invocations, I see more expected plot:

(yellow is target latency, blue is actual latency).

Here is full dump: fe.txt

Columns in dump:

target niq latency (# of samples)
actual niq latency (# of samples)
freq_coeff returned from FreqEstimator
trimmed freq_coeff (according to min and max scaling) passed to Resampler

gavv commented 10 months ago

Plotting script:

#!env python3
import sys
import pylab
import numpy

target_latency_arr = []
latency_arr = []
coeff_arr = []
trim_coeff_arr = []

with open('fe.txt') as fp:
    for f in fp.readlines():
        fld = list(map(float, f.split(' ')))
        target_latency_arr.append(fld[0])
        latency_arr.append(fld[1])
        coeff_arr.append(fld[2])
        trim_coeff_arr.append(fld[3])

latency_arr = numpy.array(latency_arr)

print((latency_arr < target_latency_arr[0]).sum())
print((latency_arr > target_latency_arr[0]).sum())

pylab.plot(latency_arr, '.-')
pylab.plot(target_latency_arr, '.-')
pylab.show()

gavv commented 10 months ago

Note: output from this lines:

print((latency_arr < target_latency_arr[0]).sum())
print((latency_arr > target_latency_arr[0]).sum())

is:

37680
38779

gavv commented 10 months ago

Well, that's funny,

Currently we're reporting logs exactly every 5 seconds.

I tried to change it to one report per every 777 frames and now I see expected values both below and above target latency!

So it seems like those 5 seconds were somehow in phase with latency oscillations. What is insane that it is reproducing very reliably.

@baranovmv Thoughts?

gavv commented 10 months ago

OK, so I was able to fix the problem, but I still don't fully understand it.

Log reports were done each 5 seconds, and they were clocked by CPU clock.

Latency calculations were done each 5 ms, and they were clocked by pipeline (pulseaudio) clock.

The clock drift between sound card and CPU, in turn, causes slow drift of the moment when we're making report (i.e. sampling current latency).

Somehow this drift causes biased sampling results (higher than expected). I still don't understand how exactly.

I reworked LatencyMonitor to clock log reports by pipeline clock instead of CPU clock, and the problem was fixed. Now sampling results are oscillating around target value no matter of sampling period.

0ebf4c46fd5195d9f3d8ec8ed20ad48bb9759bd1

gavv commented 10 months ago

Probably it is important that log reports were clocked by timer on same CPU that clocked sender (it was sending a file via localhost).

So basically we were sampling latency with the sender's clock.

gavv commented 9 months ago

More thoughts about this case.

receiver creates session when it receives first packet
on session creation, receiver also starts log reporting timer, based on cpu clock
sender sends packets every 2.5ms based on cpu clock (same machine)
log reporting timer fires every 5 seconds, i.e. every 2000 packets
in result, log reports are pretty synchronized with the sender; report is always done very soon after adding packets to queue
log reports are not synchronized with pulseaudio clock, which drives removing packets from queue

So, sender adds packets to queue and temporary grows its size, based on its clock. PulseAudio removes packets from queue and temporary shrinks its size, based on another click. And reports are always triggered very soon after sender growed queue.

Then, remember that queue size that we saw in logs was above target by 10ms, which is 5 packets.

It could be explained if packets came in bursts. So that if report is always made soon after receiving a burst of packets (because of the coincidence described above), then the reported size will be biased proportionally to the burst size.

gavv commented 9 months ago

There are no visible bursts in packet receiving time.

(X is packet number, Y is recv time)

gavv commented 9 months ago

There are bursts in frame read time, triggered by PulseAudio.

X is frame number, Y is read time.

i.e., PulseAudio blocks us for a while, then unblocks and allows write 20 times in a row, then blocks for a while again. Each write is 1ms, i.e. each bust we feed it with 20ms of audio. Time between bursts is also 20ms.