napari / napari

napari: a fast, interactive, multi-dimensional image viewer for python
https://napari.org
BSD 3-Clause "New" or "Revised" License
2.1k stars 410 forks source link

Very slow rendering when using `autoswap=True` in the VisPy Canvas #5734

Closed ksofiyuk closed 1 year ago

ksofiyuk commented 1 year ago

šŸ› Bug

I was really puzzled why the brush was not so responsive as it should have even on low-resolution images. I measured FPS and discovered that it never exceeded 10-15 when you use the brush even on very small images (e.g. examples\add_labels.py).

Initially, I thought that it was because of some very inefficient code in the Label layer, but after long hours with a debugger and profiler I eliminated this hypothesis. There is certainly room for improvement, but there is no such code that can limit FPS to 15 even on 100x100 images.

Finally, I found the culprit of this issue, it was because of using autoswap=True in the VisPy Canvas. Setting autoswap to False in the following code immediately increases the FPS from 13 to 50 in examples\add_labels.py on my laptop: https://github.com/napari/napari/blob/cf55aa005230a89b906be2447ce900be2c1f200a/napari/_qt/qt_viewer.py#L431-L436

Calling Canvas.swap_buffers at each update below blocks the OpenGL engine at least for 50-60 ms on my laptop not depending on the resolution of an image causing a significant drop in the rendering performance: https://github.com/vispy/vispy/blob/5449e3402b35da4d6a8215d8d72e19dcf05917d7/vispy/app/canvas.py#L236-L237 which leads to the execution of the following code: https://github.com/vispy/vispy/blob/5449e3402b35da4d6a8215d8d72e19dcf05917d7/vispy/app/backends/_qt.py#L869

I didn't dig any deeper into VisPy as disabling autoswap solves the issue without any noticeable bugs so far for me.

Environment

I tested it on a laptop with Windows and integrated graphics:

napari: 0.5.0a2.dev55+g8a793b9f.d20230406
Platform: Windows-10-10.0.22621-SP0
Python: 3.9.1 (default, Dec 11 2020, 09:29:25) [MSC v.1916 64 bit (AMD64)]
Qt: 5.12.9
PyQt5: 5.12.3
NumPy: 1.22.4
SciPy: 1.9.1
Dask: 2023.3.2
VisPy: 0.12.2
magicgui: 0.7.2
superqt: unknown
in-n-out: 0.1.7
app-model: 0.1.2
npe2: 0.6.2

OpenGL:
- GL version: 4.6.13596 Compatibility Profile Context 20.10.22.14 27.20.11022.14001
- MAX_TEXTURE_SIZE: 16384
ksofiyuk commented 1 year ago

I found a similar issue that someone had on Stackoverflow 6 years ago: PyQt5 OpenGL swapBuffers very slow

jni commented 1 year ago

Cool find! I must say I have no idea what autoswap and swapbuffers does, and therefore what the consequences are of turning it off. The description in the vispy canvas doctring doesn't help much... @perlman @brisvag @alisterburt... Any insights here?

ksofiyuk commented 1 year ago

@jni I've tested it more and it looks like autoswap slows down everything in napari. When you drag the canvas using the pan the FPS rises from 15 to 60 as well after setting autoswap to False.

And it looks redundant because I haven't seen any rendering problems with autoswap=False.

jni commented 1 year ago

Sure, I'm definitely on board with it not being important, but I'd like to know why it defaults to True if that's the case.

What machine do you use? I rarely see 15fps other than in labels painting... So I wonder if there is some OS-specific stuff going on here...

ksofiyuk commented 1 year ago

What machine do you use?

I'm using the machine with the following specs:

Processor: AMD Ryzen 5 4600H with Radeon Graphics, 3000 Mhz, 6 Core(s), 12 Logical Processor(s)
Installed RAM:  16.0 GB (15.4 GB usable)

It is actually not a slow machine at all. That's why I was so confused by such low FPS in napari.

I rarely see 15fps other than in labels painting...

Did you measure it? I didn't also think that actually dragging is so slow until I just measured it and saw 15 FPS. For me, not very high FPS in dragging is simply less noticeable and disturbing.

ksofiyuk commented 1 year ago

@jni I have another much more high-end hardware with GeForce RTX 3080, and while testing with it I haven't noticed at least clearly visible low FPS (but I didn't measure the FPS on it and how disabling autoswap exactly affects the FPS). But I would be very surprised if it was laggy on such a machine.

jni commented 1 year ago

doing the viewer_fps_label.py example with a 5000x5000 np.random.random, I see 30-90fps on my Macbook Pro... It feels very smooth...

I wonder if it has something to do with GPU memory transfer rather than using shared memory like in the Macbooks...?

Ultimately, I'm not against removing it, but I'd love to understand what it does before we do it. Otherwise it feels like removing a nut from a car without understanding what it's holding... šŸ˜…

ksofiyuk commented 1 year ago

I wonder if it has something to do with GPU memory transfer rather than using shared memory like in the Macbooks...?

It doesn't look to be a valid hypothesis as my laptop with Ryzen 5 4600H has integrated graphics, which shares the same RAM with the CPU.

jni commented 1 year ago

ah, ok, I thought "radeon graphics" meant discrete GPU. Well... šŸ¤·!

alisterburt commented 1 year ago

Very cool find @ksofiyuk and thanks for the ping @jni

From some Googling it seems "double buffering" is used to prevent flickering and improve the smoothness

An OpenGL canvas maintains two buffers for drawing:

The front buffer is the buffer that is currently visible on the screen. It contains the latest completed frame. The back buffer is the buffer that is not visible on the screen. It is where the application draws the next frame while the front buffer is being displayed.

"swap buffers" swaps the roles of the front and back buffers. The back buffer, which contains the newly drawn frame, becomes the front buffer and is displayed on the screen, while the previous front buffer becomes the back buffer and is now used for drawing the next frame.

It's possible that turning this off will lead to some 'tearing' where partially rendered frames are visible sometimes - I don't have any experience with this so I vote we run the experiment, get a few people on different systems to test and see what happens!

alisterburt commented 1 year ago

There are some hardware calls which will be pretty dependent on the windowing system/openGL installation... "Swapping buffers is typically performed by calling a function such as SwapBuffers (in Windows) or glXSwapBuffers (in Linux), depending on the platform and the windowing system being used."

I suspect this will be some particular hardware/software interaction causing a slowdown - I also see >60fps most of the time but still think we should push ahead and experiment with turning autoswap off if the potential gains are so significant

ksofiyuk commented 1 year ago

doing the viewer_fps_label.py example with a 5000x5000 np.random.random, I see 30-90fps on my Macbook Pro... It feels very smooth...

@jni Btw, examples/viewer_fps_label.py doesn't seem to work for me using the current main. The callback update_fps is never called...

alisterburt commented 1 year ago

@ksofiyuk just tested on my machine and working fine, platform specific bug potentially?

napari: 0.5.0a2.dev129+ge8d92f15
Platform: macOS-13.1-arm64-arm-64bit
System: MacOS 13.1
Python: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:41:22) [Clang 13.0.1 ]
Qt: 5.15.4
PyQt5: 5.15.7
NumPy: 1.23.2
SciPy: 1.9.0
Dask: 2022.8.1
VisPy: 0.12.1
magicgui: 0.5.1
superqt: unknown
in-n-out: 0.1.5
app-model: 0.1.1
npe2: 0.6.1

OpenGL:
  - GL version:  2.1 Metal - 83
  - MAX_TEXTURE_SIZE: 16384

Screens:
  - screen 1: resolution 1512x982, scale 2.0

Settings path:
  - /Users/alisterburt/Library/Application Support/napari/napari_fbb2073d056ea8768c19c829de069e5465be0c67/settings.yaml
Plugins:
  - napari: 0.5.0a2.dev129+ge8d92f15 (77 contributions)
  - napari-console: 0.0.6 (0 contributions)
  - napari-mrcfile-reader: 0.1.3 (2 contributions)
  - napari-svg: 0.1.6 (2 contributions)
ksofiyuk commented 1 year ago

just tested on my machine and working fine, platform specific bug potentially?

@alisterburt Is your replay related to examples/viewer_fps_label.py?

alisterburt commented 1 year ago

@ksofiyuk yep - just asserting that the callback is indeed called and the example runs as expected on my machine on main

ksofiyuk commented 1 year ago

@ksofiyuk yep - just asserting that the callback is indeed called and the example runs as expected on my machine on main

Ok, then it looks like it is indeed a platform specific bug.

ksofiyuk commented 1 year ago

@jni @alisterburt Sorry, for false reporting, it was my fault, I just accidentally broke my local vispy while debugging the subject of this topic. Now, it's working for me.

P.S. It is only regarding examples/viewer_fps_label.py

psobolewskiPhD commented 1 year ago

@ksofiyuk Probably not directly related to this, but I note you have PyQt5: 5.12.3 Is there a specific reason you're using this version, which is quite old (it's napari min requirement)? I ask because it's increasingly tricky to support pre 5.13 Qt while trying to improve Qt6 support.

ksofiyuk commented 1 year ago

@jni with autoswap=True even with the default small 5x5x5 volume it shows 16 FPS šŸ˜•

autoswap = True 2023-04-16_11-42-42

autoswap = False 2023-04-16_11-45-28

ksofiyuk commented 1 year ago

@ksofiyuk Probably not directly related to this, but I note you have PyQt5: 5.12.3 Is there a specific reason you're using this version, which is quite old (it's napari min requirement)? I ask because it's increasingly tricky to support pre 5.13 Qt while trying to improve Qt6 support.

@psobolewskiPhD I've just updated it all to Qt: 5.15.2 PyQt5: 5.15.9. The performance issue with autoswap=True remains exactly the same.

There is no any specific reason for PyQt5: 5.12.3. I just haven't updated it for a long time šŸ˜„ .

alisterburt commented 1 year ago

thanks for reporting back @ksofiyuk !

I ran the same test as you @ksofiyuk and enabling autoswap makes no discernible difference in FPS on my system.

It does not feel like the double buffering makes a big difference during 2D/3D rendering on my system, there is a slight difference when rotating the 3D camera extremely quickly but nothing perceptible when doing things 'reasonably'. In this context, I think it would be safe to default to autoswap=False on the canvas to solve this problem

Czaki commented 1 year ago

Maybe it is a bug in @ksofiyuk Windows manager. Like in crashing the Windows manager on a big cursor?

As @alisterburt described, the swap buffer is to avoid flickering. It is a common practice used in programming GUI applications (even many games using it).

So if there will be a plan to fix it, then it should be done by env variable or settings. Not globally.

ksofiyuk commented 1 year ago

Maybe it is a bug in @ksofiyuk Windows manager. Like in crashing the Windows manager on a big cursor?

@Czaki On the contrary, I don't have any problems with a big cursor on this Windows laptop. I have this issue with another high-end computer with RTX 3080 on which Xubuntu is installed (and where I don't have this issue with autoswap). šŸ˜„

ksofiyuk commented 1 year ago

As @alisterburt described, the swap buffer is to avoid flickering. It is a common practice used in programming GUI applications (even many games using it).

@Czaki @alisterburt I'm pretty sure that swapBuffers is just not used in a proper way in VisPy causing performance issues on some machines. It is just like here PyQt5 OpenGL swapBuffers very slow, where the issue was caused by inappropriate use of swapBuffers.

I could blame my system if all OpenGL applications were laggy, but I don't have problems with other 3D applications, even with other PyQt apps that use OpenGL.

Czaki commented 1 year ago

which Xubuntu

I remember multiple problems on Xubuntu, and I finally switched to Gnome Shell. As I read the team who manages XFCE is very limited and they do not have enough resources to fix all problems.

It is just like here PyQt5 OpenGL swapBuffers very slow, where the issue was caused by inappropriate use of swapBuffers.

Based on the accepted response to linked StackOverflow. Could you check if #5710 solves your problem?

ksofiyuk commented 1 year ago

Could you check if https://github.com/napari/napari/pull/5710 solves your problem?

@Czaki I've checked it a few days ago. Unfortunately, it didn't help at all. As I described in the first message, I did a lot of profiling of napari to figure out the reason of slow down. It turned out that swapBuffers blocks the OpenGL for 50-60 ms every call, no matter what the size of an image. It has nothing to do with napari.

ksofiyuk commented 1 year ago

I'm not an expert on OpenGL, so it is hard for me to say what exactly causes this slow down, but I can imagine that the way the tasks are sent to OpenGL or how often you call swapBuffers can affect the overall performance.

alisterburt commented 1 year ago

this is really interesting, I'm learning a lot reading around this - it would be interesting to see whether throttling only the swapBuffers calls improves performance for you

If you want to play with this @ksofiyuk I think you could throttle the callback in vispy directly with https://psygnal.readthedocs.io/en/stable/API/throttler/

ksofiyuk commented 1 year ago

@Czaki @alisterburt One possible explanation that I can think of - the way how swapBuffers acts (in synchronous or asynchronous mode) clearly depends on your system, even maybe on the driver of your video card. In my case, it is clearly executed in a blocking way waiting for all OpenGL commands to finish before swaping the buffers, which causes the slow down.

Updating something like 5x5 image is definitely cannot take anything near 50 ms even on 20 years old machines, so it means that VisPy on each draw update is likely to send some other much more heavy tasks irrelevant to the image update itself.

On the systems, where swapBuffers is executed in an asynchronous way, you just cannot notice this irrelevant background tasks, because OpenGL executes in asynchronous by design, it just makes your GPU a little bit more busy.

In order to proper solve this issue, we need to understand what tasks OpenGL executes exactly at each draw update. I think just disabling swapBuffers or even throttling it is like sweeping problems under the carpet, because there are definitely some other issues that are really the cause of the problem. Very long execution of swapBuffers is just a symptom.

Czaki commented 1 year ago

@ksofiyuk As the problem is on windows. Did you have propertiary driver, and if yes in which version?

ksofiyuk commented 1 year ago

As the problem is on windows. Did you have propertiary driver, and if yes in which version?

@Czaki I have the Ryzen 5 4600H CPU with Radeon RX Vega 6 integrated graphics. I was using the driver that Windows installs and updates automatically. It turned out that it was very old (version 22 from 2020), I updated it to the latest version - 31.0.14043.7000 (26-Mar-23).

However, unfortunately it didn't resolve the issue. But it gave me some boost. Now I have 18-20 FPS with autoswap=True instead of previous 15-16 (+5 FPS boost!), and when I use add_labels.py example with autoswap=False the FPS increased from 50 to 60. The idea of updating driver was good and gave me some performance boost, but it is not a solution.

Czaki commented 1 year ago

looking on this wiki https://www.khronos.org/opengl/wiki/Swap_Interval it looks like selection of synchronized and non synchronized swap buffer should be possible.

I see few references to glFinish in vispy. Maybe this could be a good way to research?

ksofiyuk commented 1 year ago

@Czaki @alisterburt The issue looks to be caused by the fact that VisPy does a lot of stuff in a single thread. The correct way of dealing with OpenGL is to keep it in a separate thread from other data processing since it can be blocked in an unintuitive way known only to the GPU driver - for example, it may block the current thread thinking that it does nothing and the resources should be allocated to another application.

I've just done a profiling of OpenGL calls, and and it takes less than 1 ms to process all the VisPy OpenGL tasks for a single frame, the driver thinks that this application is not heavy-loaded and just gives it very low priority allocating time to another processes.

image

I've found this answer on StackOverflow useful: Avoid waiting on SwapBuffers

ksofiyuk commented 1 year ago

When I put SwapBuffers in a separate thread in VisPy, the issue is completely gone - SwapBuffers are executed correctly (judging by a trace log of OpenGL calls) and VisPy and napari are not blocked anymore.

Czaki commented 1 year ago

And which mechanism do you use to synchronize threads?

ksofiyuk commented 1 year ago

And which mechanism do you use to synchronize threads?

@Czaki I synchronized them using queue.Queue, the main thread just puts tasks to another thread, a thread is blocked until it gets a new task - very simple stuff. I did it in an ugly and hard-coded way just to test the hypothesis. šŸ˜„

Czaki commented 1 year ago

Maybe threading.Lock could be better mechanism for this.

Should this fix happen in napari or in vispy? If in vispy then call @brisvag for review as he is napari core-dev and has permissions in vispy.

ksofiyuk commented 1 year ago

Did this fix should happen in napari or in vispy?

I think it has almost nothing to do with napari as all the OpenGL stuff is happening inside VisPy. VisPy should just have another thread for OpenGL stuff.

The simplest solution is to put only SwapBuffers in a separate thread, it works at least for me and completely resolves the issue, using something like this (this code is just for illustrative purposes):

    # def __init__
        self._swap_buffers_worker = threading.Thread(target=self.swap_buffers_thread, daemon=True)
        self._swap_buffers_queue = queue.Queue()
        self._swap_buffers_worker.start()
        # ...

    def swap_buffers(self, event=None):
        self._swap_buffers_queue.put(1)

    def swap_buffers_thread(self):
        while True:
            self._swap_buffers_queue.get()
            self._backend._vispy_swap_buffers()

instead of just

    def swap_buffers(self, event=None):
        self._backend._vispy_swap_buffers()

I think it should be implemented somewhere there: https://github.com/vispy/vispy/blob/5449e3402b35da4d6a8215d8d72e19dcf05917d7/vispy/app/canvas.py#L415

ksofiyuk commented 1 year ago

Maybe threading.Lock could be better mechanism for this.

@Czaki Agreed, Lock seems more suitable here:

    def swap_buffers(self, event=None):
        self._swap_buffers_lock.release()

    def swap_buffers_thread(self):
        while self._swap_buffers_lock.acquire():
            self._backend._vispy_swap_buffers()
Czaki commented 1 year ago

I wrote the previous comment from memory.

I just checked the documentation. And there is such information:

When invoked on an unlocked lock, a RuntimeError is raised.

So it should be something like

    def swap_buffers(self, event=None):
        if self._swap_buffers_lock.locked()
            self._swap_buffers_lock.release()

Or go back to queue.

But I start thinking if we should not use barrier https://docs.python.org/3/library/threading.html#threading.Barrier

The general thing is that swap buffer is used to swap rendered buffer and draw buffer.

So if we call swap buffer in async way it may happen that the "signal" to swap buffer was sent. Then new draw phase starts. And during this draw phase the swap will be started.

If we enforce synchronization using barriers it should reduce the probability of such an event as the main thread will wait until the sync thread is ready.

ksofiyuk commented 1 year ago

So if we call swap buffer in async way it may happen that the "signal" to swap buffer was sent. Then new draw phase starts. And during this draw phase the swap will be started.

As far as I understand, that cannot happen in OpenGL. OpenGL uses its own scheduler, when you call its methods, you just put tasks in its internal queue:

OpenGL Rendering Commands are assumed to execute be asynchronous. If you call any of the glDraw* functions to initiate rendering, it is not at all guaranteed that the rendering has finished by the time the call returns. Indeed, it is perfectly legitimate for rendering to not even have started when this function returns. The OpenGL specification allows implementations the freedom to get around to rendering commands whenever it is best for them.

So, OpenGL knows the correct order of its tasks, no matter when you put them unless of course you have separate threads that are fighting for the same OpenGL context, which is not our case.

ksofiyuk commented 1 year ago

Or go back to queue.

@Czaki It seems that only using a queue is correct. With lock you can miss one or more 'SwapBuffers' if the main thread tries to send the second command while the swap thread is still blocked by the first one.

Czaki commented 1 year ago

Also, semaphores allows to not miss any event. But maybe the queue will be more precise.

brisvag commented 1 year ago

Looks like you went already deep down the rabbithole :) I think this should be brought up over at Vispy for sure. @ksofiyuk since you have a good picture of the problem, do you mind opening an issue at vispy (linking here as well)? Otherwise let me know and I can take care of it.

In the meantime, even heavy stuff with fast camera movements looks exactly the same for me with autoswap=False, so I'm in favor of doing that in napari until we figure it out on the vispy side.