bf40e76 disrupts remote control on Mac build

markjfine commented 1 month ago

Understand this was mostly a graphics update that doesn't really touch anything to do with remote control. Nevertheless, something is crushing remote control output. GQRX is acting on external commands but isn't responding reliably when queried. Correct operation is restored when bf40e76 is reverted via git.

argilo commented 1 month ago

Are you able to grab a stack trace in a debugger?

argilo commented 1 month ago

Oh wait, I thought it was a crash but I guess it's just misbehaving.

argilo commented 1 month ago

I don't immediately see what connection that commit could have to the remote control, so some investigation will be needed.

markjfine commented 1 month ago

Agree. I'll continue trying to track down what's going on by doing a bit more debugging. It's possible that there may be a Qt graphics processing delay (prioritising trace and waterfall over everything else) that requires me to increase the socket blocking request wait time following a command in Frequency Browser. It's already set to 50000 uSec, but that should be more than enough, otherwise it starts to get sluggish.

argilo commented 1 month ago

I believe graphics and networking are both processed on the UI thread (in an event loop) so changes to the plotter could affect timing of remote control commands.

Expecting a remote control response within a short time frame may not be a great idea. Things like FFT size changes can hold up the UI thread for a while.

argilo commented 1 month ago

That said, I would expect the performance improvements in #1383 to result in the UI thread spending less time on drawing, not more.

markjfine commented 1 month ago

I would expect the performance improvements in #1383 to result in the UI thread spending less time on drawing, not more.

Normally, I'd tend to agree. But there's a massive hit on responsiveness when bf40e76 is applied.

I've been typically running GQRX with an FFT Size of 16384 and Rate of 25 fps. This has never been an issue with respect to control timing.

However, now whenever I tune to something, or change the bandwidth or mode remotely, I've noticed there's a marker over the Rate label that says '>>> Rate' in white on red background for a certain length of time. I can drop down to 1 fps and this will still occur. Previously this was a short blink and I hardly even noticed it but now it is a lot more pronounced.

Moreso, what I've noticed is that no comms seems to occur until the red warning goes back to normal indicating some kind of processing delay. This could explain why an initial tuning, mode, or bandwidth command works, but any follow-up query of meter, freq, mode, and bandwidth seems to fail until '>>> Rate' reverts back to just 'Rate'.

I have a Sync button on Frequency Browser that will re-query everything, just in case the user changes something manually on GQRX and wants to re-synch the app (usually for station database lookup). This works fine for when the above query fails. It can be used after the Rate label returns to normal, but is kind of cumbersome after a while.

Just for reference, this is on a MacBook M1 Pro (2021), 16" Retina XDR display, 16-core GPU, 16 GB memory, running Sequoia 15.0.1.

Going to continue to examine @yuzawa-san's changes to see what could be causing this.

yuzawa-san commented 1 month ago

i have no idea what could have caused this. i personally do not use the remote, but if you let me know the exact steps you took i could try to reproduce. it is ok when the remote is not used?

argilo commented 1 month ago

If you run in a debugger and press Ctrl+C during a stall, the backtrace should reveal the culprit.

markjfine commented 1 month ago

i have no idea what could have caused this. i personally do not use the remote, but if you let me know the exact steps you took i could try to reproduce. it is ok when the remote is not used?

The steps I took to reproduce was first to 'git pull origin master -t', then build. Tested and noticed the remote communications problem. Attempted to increase socket windows, which I took from 50000 uSec to 400000 uSec, which was clunky and was still erratic. Taking the Rate down from 25 fps to 2 helped a little but still was erratic. Did a 'git revert bf40e76 -n' and rebuilt, set the parameters back to the way they were, problem went away.

Actually, the '>>> Rate' blip is still more pronounced than before even without using Frequency Browser. You can manually tune with Frequency Manager killed and it still appeared. I should note that to me, the '>>> Rate' thing is seems like a symptom of processing getting bogged down.

Other FFT Settings I'm using (if it helps) are: WF Span: Auto Window: Blackman-Harris Plot Mode: Avg, green, Fill Plot Scale: dBFS Averaging: Slightly higher than mid-way, just Band Plan checked. Split Plot: Half-way Plot and WF locked WF Mode: Sync, Gqrx Freq Zoom: off.

Also noticed you were doing most of your work in the FM band, whereas I'm using Shortwave. Guaranteed there's a lot more active drawing to do in a couple of MHz of SW than in the FM band. That said, I've looked at your code and the only thing that looks like it could radically affect things is the conversion in some areas from discrete pen graphics in Plotter.cpp, but that's just supposition on my part.

yuzawa-san commented 1 month ago

ok i'm at a bit of a loss right now. i looked over the pr and basically every optimization in there should have made things better, since in a lot of places its simply doing physically less stuff. i found that of the optimizations the fill was most improved. that by far used the most CPU prior, so i'm shocked that the latest changes would be worse. maybe disable the fill and see if that improves stuff, at least that may be a lead.

in my original pr, i generated the flame graphs. i used the https://formulae.brew.sh/formula/flamegraph and ran this once the application was up and running, feel free to tune the params if needed:

sudo dtrace -x ustackframes=256 -n 'profile-1ms /execname == "gqrx"/ { @[ustack()] = count(); } tick-15s { exit(0); }' \
| tee /tmp/$(date +"%Y-%m-%d_%H.%M.%S")-ProcessName_sample.txt \
| stackcollapse.pl \
| flamegraph.pl \
> /tmp/$(date +"%Y-%m-%d_%H.%M.%S")-WindowServer_sample.txt.svg

maybe something would pop out there which is unique to your use case.

markjfine commented 1 month ago

Hah... I cherry-picked the pr back in, then flipped Fill off and it's now working properly again. Interesting. So now we know where to look.

markjfine commented 1 month ago

Also ran flamegraph with both GQRX and Frequency Browser running.

First with Fill off: 2024-10-20_19 18 03-WindowServer_sample txt

Then with Fill engaged: 2024-10-20_19 21 39-WindowServer_sample txt

I need to read more about what I'm looking at, but if I had to guess it looks like Fill is eating a cpu on this box.

yuzawa-san commented 1 month ago

very interesting, we got real lucky on that hunch. i found that the fill (before) was actually done in a shared thread pool (see attached flame graph screenshot, middle section see QThreadPoolThread::run), so most of the fills in the ui event loop were quite short since the "work" was offloaded to those other threads. this may explain why the event loop would appears more clogged now. since the work was moved it may make it appear there is a slow down in that ui loop. unclear how we should proceed from here. the prior implementation was quite cpu intensive but did not jam up the event loop, but the new one does appear to jam up the loop while using less cpu. the remote actions should live in that event loop being they are essentially actions on the ui as i far as i see it. a compromise could be to to do the expensive drawing near the "top" where the variance is, but draw the "bottom" where there is no variance as a solid fillRect (which should be faster). my hunch is that when the avg/max line has a lot of variance, the polygon rasterizer has to break it into a lot of little nonsequential writes. hey at least the solid color composition from that PR is working. i'll dig in over the next week.

2024-10-02_19 37 32-WindowServer_sample txt

yuzawa-san commented 4 weeks ago

@markjfine please try out https://github.com/gqrx-sdr/gqrx/pull/1391

markjfine commented 4 weeks ago

@markjfine please try out #1391

Still causes a significant delay when the Fill is switched on.

markjfine commented 4 weeks ago

Also: Still have the red '>>> Rate' warning when manually tuning. Not as pronounced as it was before, but still there. That could indicate some progress.

yuzawa-san commented 4 weeks ago

can you please attach another flame graph

markjfine commented 4 weeks ago

2024-10-24_20 00 03-WindowServer_sample txt

yuzawa-san commented 4 weeks ago

ok i decided to use vertical lines in this branch. i found there were further improvements, please give that a try and i'll update the PR if it looks promising. https://github.com/gqrx-sdr/gqrx/compare/master...yuzawa-san:gqrx:fill-optimization2?expand=1

markjfine commented 4 weeks ago

I think we may have regressed. It's possible that you may have corrected the drawPolygon and fillPolygon the last time, and the rest of the problem was somewhere within draw and drawPoints. Here's the flame using the changed plotter.cpp and plotter.h:

2024-10-25_08 22 56-WindowServer_sample txt

yuzawa-san commented 4 weeks ago

hmm that flamegraph is a bit odd, there should be a drawLines, but it is not present at all. its not like drawPolygons is an inlined version of drawLines to the best of my knowledge. also the drawPolygon would call drawPath instead of fillPolygon. can you try a clean build from that pr. i cannot seem to reproduce this, even with your settings. my flame graphs and settings looks different and i have good performance.

fill off: 2024-10-25_17 16 43-WindowServer_sample txt

fill on:

2024-10-25_17 15 57-WindowServer_sample txt

regardless, these graphs are a significant improvement over what is currently in master (which is here https://github.com/gqrx-sdr/gqrx/issues/1386#issuecomment-2425286509 ), so i have updated my PR. i'm not sure how much more optimization can be done. my only hunch was that since the previous fills were happening in the thread pool, the parallelism gave an overall speedup in certain circumstances (possibly number of under-utilized cores and overall system utilization). however in general, i found that the fill setting was basically unusable past several fps in the old version on my 4 core macbook:

2.17.5 with those settings and fill on (177% cpu):

master@0418221 with those settings and fill on (110% cpu):

this branch with those settings and fill off (66% cpu):

are you still reporting issues on your end even after a clean build? is a solution at this point, if you need very fast remote responses, just to disable fill?

markjfine commented 3 weeks ago

Just so we're clear, I'm using the current git origin master but with the modified plotter.cpp and plotter.h from the pr. I always do a clean build by deleting and remaking the build folder, running 'cmake ..', then 'make -j15'.

After I mentioned the first set still had a problem (flame chart with drawPoints elongated), I think you tried to reintroduce something in drawPolygons that regressed the problem and made it worse (as depicted). As I intimated, this leads me to believe there's an additional problem in the draw and drawPoints routines, and that the first set of adjustments to drawPolygons that you did was fine.

Yes, I can turn Fill off and everything works as it should - as a workaround. However, there are other people that are regularly using Frequency Browser in conjunction with GQRX, and have been doing so with no issues. Not sure that will be the case if GQRX's remote capability can't be restored to function as it was.

And, it appears I'm not alone. It appears someone else has commented on the PR that this has slowed his machine as well.

markjfine commented 3 weeks ago

Latest rev (https://github.com/gqrx-sdr/gqrx/pull/1391#issuecomment-2439204206, fillRect mod) seems to be working. Rebuilt clean using the revised plotter.cpp. Ran GQRX with Fill engaged and Frequency Browser actively polling signal levels. No '>>> Rate' warnings and no missed remote control queries, even when manually tuning.

Attached is latest flamegraph with Fill engaged: 2024-10-26_07 50 07-WindowServer_sample txt

It occurred to me late yesterday when I thought about draw/drawPoints that it could be a latent Qt graphics issue. Glad we could come to an agreeable solution.

argilo commented 3 weeks ago

To help with measurement, I put together this Python script, which hammers Gqrx with remote commands and measures the worst latency over one-second intervals:

#!/usr/bin/env python3

import socket
import time

HOST = "127.0.0.1"
PORT = 7356
INTERVAL_LENGTH = 1.0

with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
    s.connect((HOST, PORT))

    while True:
        start_interval = time.time()
        worst = 0
        while time.time() < start_interval + INTERVAL_LENGTH:
            s.sendall(b"f\n")
            start = time.time()
            s.recv(1024)
            end = time.time()
            diff_ms = (end - start) * 1000
            if diff_ms > worst:
                worst = diff_ms

        print(f"Worst latency: {worst:.2f} ms")

Sample output, tested with Gqrx 2.17.5 running on Linux, full-screen with a low-DPI monitor:

Worst latency: 4.83 ms
Worst latency: 5.79 ms
Worst latency: 5.06 ms
Worst latency: 4.87 ms
Worst latency: 14.00 ms
Worst latency: 14.02 ms
Worst latency: 13.66 ms
Worst latency: 14.97 ms

In the middle of that run I enabled Fill, which increased the latency from ~5 to ~14 ms.

Even on Gqrx 2.17.5, massive latency numbers (several seconds!) can be seen if the plotter axes are dragged or zoomed while Fill is turned on.

Perhaps plotter painting should be done on a background thread so it doesn't tie up the UI thread.

argilo commented 3 weeks ago

On my system (Ubuntu 24.04, Qt 6.4.2), latency numbers are actually improved on the master branch (v2.17.5-13-g0218221):

[ Fill off ]
Worst latency: 5.49 ms
Worst latency: 5.26 ms
Worst latency: 4.17 ms
Worst latency: 4.65 ms
[ Fill on ]
Worst latency: 10.24 ms
Worst latency: 10.82 ms
Worst latency: 9.27 ms
Worst latency: 8.99 ms

markjfine commented 3 weeks ago

Good confirmation for my 50mS blocking window. I'm using 250mS for mode/bw calls, but for the life of me can't remember my rationale for doing so on set commands - possibly because they're combined. 🤷🏻‍♂️

This also provides a brilliant way to pelt the app with different commands/queries (other than freq) and measure the response outside of Freq Browser's xCode debug environment. Cheers, @argilo.

markjfine commented 3 weeks ago

...just remember to disable the firewall in macOS. Otherwise you won't be able to use the port again until you reboot. #ugh

vladisslav2011 commented 3 weeks ago

I've slightly modified the latency testing python script to track the best and the average latencies in addition to a worst one and tested my fork (with some patches from @yuzawa-san ): plotter-fill

The fill was enabled during test and led to maximum latency increase from 22...23 ms to ~24 ms, but the average latency is close to ~50 us both with and without the fill.

Another test at 100fps (when plotter takes >99% of Qt GUI thread time) shows the same insignificant increase in maximum latency from 22...24 ms to 26..28 ms, and small change in average latency from 13 ms to 14.5 ms when switching from "fill off" to "fill on" mode: plotter-fill2

I can't believe that M1 MacBook has hard times rendering the plotter at 16k FFT/25fps as my laptop (i7-8550u) is rendering it at 64k FFT/75fps with CPU constrained to max non-turbo 1.8GHz clock.

markjfine commented 3 weeks ago

With the current changes in place, I experienced no more than 10mS on average with Fill engaged on my M1 at 16k/25fps. It jumped to over 20mS when moving the window, but this is certainly a workable range. Incidentally, manually changing the mode drove the latency between 100-200mS.

While we're talking PCs: I should also test on my gaming i7 (Lenovo Legion 5 Pro with nVidia GeForce RTX 3050 Ti) while dual-booted into Fedora, just to satisfy my curiosity.

markjfine commented 3 weeks ago

Similar numbers under Linux using the same parameters: 3 or 4 mS and only jumping to 10 or 20mS when moving the window. Much better on mode changes: < 20mS there.

yuzawa-san commented 2 weeks ago

is #1391 a suitable solution to this issue?

markjfine commented 2 weeks ago

is #1391 a suitable solution to this issue?

I thought so.

gqrx-sdr / gqrx

bf40e76 disrupts remote control on Mac build #1386