Slider is slow with time series of large 2d images

napari / napari

napari: a fast, interactive, multi-dimensional image viewer for python

https://napari.org

BSD 3-Clause "New" or "Revised" License

2.17k stars 418 forks source link

Slider is slow with time series of large 2d images #1300

Open pwinston opened 4 years ago

pwinston commented 4 years ago

🐛 Bug

For a time series of 8k x 8k x float64 images it takes around 200ms to switch slices. For 16k x 16k it takes around 1500ms. In both cases moving the slider through many slices feels slow and laggy.

To Reproduce

# 8k images
napari.view_image(
            np.random.random((2, 8192, 8192)), name='two 8k 2d images'
        )
# 16k images
napari.view_image(
            np.random.random((2, 16384, 16384)), name='two 16k 2d images'
        )

Note: 16384 is the max texture size on a macbook w/ MD Radeon Pro 5300M 4 GB. Any image larger that will be downsampled to 16384. So we basically just cannot view images larger than that unless they are multiscale.

Expected behavior

Switching images is much faster.
To the degree it's not faster the loading should be incremental and interruptible.

You want to be able to interactively "dial in" what slice you care about, moving the slider freely without delays. It's okay if the image takes a bit to fully load once you stop moving.

pwinston commented 4 years ago

Performance Monitoring Results

Using the performance monitoring stuff from #1262 here what it looks like to move between two 8k images using the "next" button:

Selected Times In Above Diagram:	step	8k time (ms)
overall	208	1551
Dims.set_point	85	714
Paint	114	818
1 - data.astype	64	625
2 - clim[0]	43	578
3 - data /=	22	105
flush_commands	39	125

Where:

The data.astype conversion from float64 for float32 in napari._vispy.VispyImageLayer._on_data_change
The line data = (data - clim[0]) which triggers a copy in vispy.visuals.ImageVisual._build_texture
The line data /= clim[1] - clim[0] in vispy.visuals.ImageVisual._build_texture

Notes:

The last line flush_commands is waiting on the card to draw. This alone is maybe okay for 8k but way too slow for 16k. It's not clear how much of this is just the card drawing a texture that big, and how much is other overhead that vispy is adding. Would have to be investigated.
Converting from float64 to float32 is really slow. Could we do this on load?
Not sure about the two clim related lines but they are slow, need to explain what's. happening there and can it be avoided or sped up.

Performance Today

Ideally the total time to switch slices is under 16.7 ms (60Hz) but 50ms (20Hz) might be pretty reasonable. Where we are today:

Image Size	Over 60Hz Goal	Over 20Hz Goal
8k	12X	4X
16k	93X	31X

So we are 93 times too slow for a 16k image if want it to go at 60Hz.

Tiled (Chunked) Rendering

If we can speed these things up that's great and will help a lot. Beyond that though I suspect ultimately we need a tiled renderer here just as much as we do with multi-scale.

It's tempting to think we have 2 different types of data with napari multi-scale (big) and in-memory (small). But really I think all data whether in-memory or not needs to be treated as if it were big.

For multi-scale data that's chunked on disk the path is Disk/Network -> RAM -> VRAM. There tiles benefit us in both hops. For in-memory data the path is just RAM -> VRAM, but tiles are just as critical for that one hop. And unlike disk/network where we can do stuff with threads, as far as I know paging to VRAM must be done in the main thread. We can only page a small amount of data each frame, so it has to be done in chunks.

Image Sizes and Tile Sizes

Image Size	Memory	Number of 512x512 tiles	Number of 256x256 tiles
8k x 8k x 4 bytes	256MB	256	1024
16k x 16k x 4 bytes	1024MB	1024	4096

Tiles Size	Memory
256 x 256 x 4 bytes	0.25MB
512 x 512 x 4 bytes	1MB

Because of squaring it's not intuitive just how big these big images are. A 16k x 16k image is 1024X bigger than a 512x512 tile. That alone is kind of surprising.

256MB or 1024MB is just a lot of data to move around in RAM or send to the GPU. A lot to move as one solid block. It's much easier and better to move 0.25MB to 1MB chunks. In a tight loop moving a lot of data in small chunks won't be much slower than a single big move, but it will be vastly more granular and interruptible.

Also to cover a 4k screen (3840 x 2160) you only need 12% of the 8k image or 3% of the 16k one, assuming you have downsampled imagery. So rendering the full 8k or 16k is overkill, you are sending all the data to the GPU just so it can downsample it.

Benefits of Tiles

Incremental streams of small updates instead of one huge one.
If downsampled versions are available, we can fill the screen with far less data.

Recommendations

We fix the 32-bit conversion the clim stuff. This will help a lot.
We create a tiled renderer that works for these in-memory cases just like it does for #845 and multi-scale.

sofroniewn commented 4 years ago

This is great @pwinston - will take me a bit to digest it all, but I want to point out of the clim improvements are in progress here https://github.com/napari/napari/pull/1056 and in linked vispy PRs. As you note, we should definitely fix, and picking up that PR would be great, but even then we'll hit limitations. More coming later!!

pwinston commented 4 years ago

I think the bottom line is chunks/tiles are needed as much for going from RAM to VRAM as they are for going from disk/network to RAM. You might think "well my large image fits in RAM so I'm good" but a 16k x 16k x 32bit image is 1GB. Any operation you do on 1GB is going to be slow relative to a frame which is only 16.7ms.

Separately a full screen 4k display can only show 8M pixels not the full 268M pixels. That alone suggests waiting while that 1GB is copied to the card in order to draw 3% of it is not a good idea, assuming you were fully zoomed in.

I think the very general lesson here is both disk and memory have sectors/blocks/pages. Everything is 4kb or 16kb or some small size. At the level of numpy you can trivially create a giant uniform block of memory, but that's just an abstraction. Any actual machinery that processes it with real hardware needs to break it down into smaller pieces.

sofroniewn commented 4 years ago

We create a tiled renderer that works for these in-memory cases just like it does for #845 and multi-scale.

This makes a lot of sense, I think we should also be thinking about making the tiled renderer able to render meshes (and so in turn our other napari layer types like "shapes" and "surface") in a high performance way, including multiscale meshes. @perlman was recently mentioning some highly optimized code for doing this in neuroglancer too.

pwinston commented 4 years ago

It'd be helpful to understand Neuroglancer and BigDataViewer and maybe some others. Although that's challenging, the code can be very dense. Sometimes it's good to just get started on something. Then when we start to hit issues dive into the other packages for ideas. Once you have more context. There's a chance you need to rewrite things, but that's not horrible if you learned a lot. Of course if we can just ask around, that's probably worth it.

I can somewhat picture a totally generic octree where "images" is kind of a plugin. A plugin that at a minimum specified how to store images at any node (2d or 3d) and how to downsample images. For images downsampling is literally just downsampling, but for other types of a data it could be something totally different. So then you can create new "plugins" for other types of data. Mesh decimation can be super involved and complicated, like 100x harder than downsampling images, but you can maybe start simple.

One not totally obvious thing you can do are figurative types of "downsampling". So you don't try to be visually accurate at all, but at the higher levels you have bounding boxes or blobs or transparent rectangles. Like they do with maps sometimes, below are giant aggregated circles but if you zoom in you see the actual points:

But that's a side point. The main idea I think is an octree with "plugins" for each datatype, where the plugins can get more sophisticated over time.

The hardest thing about Raveller's quadtree was there was one operation where the user could actually modify the pixels. The pain there was that invalidated every layer the of quadtree, you had to recompute them all on the fly. If we had to do that for meshes, that could get seriously complicated.

sofroniewn commented 4 years ago

I can somewhat picture a totally generic octree where "images" is kind of a plugin. A plugin that at a minimum specified how to store images at any node (2d or 3d) and how to downsample images. For images downsampling is literally just downsampling, but for other types of a data it could be something totally different. So then you can create new "plugins" for other types of data. Mesh decimation can be super involved and complicated, like 100x harder than downsampling images, but you can maybe start simple.

One not totally obvious thing you can do are figurative types of "downsampling". So you don't try to be visually accurate at all, but at the higher levels you have bounding boxes or blobs or transparent rectangles.

I love both these concepts. I think we'd want to do the later with our points layer for example. Something like that has already been requested by a user who has lots of points overlaid on an image (from an image based transcriptomics experiment where you do lots of spot finding)

The hardest thing about Raveller's quadtree was there was one operation where the user could actually modify the pixels. The pain there was you can to recompute all the affected tiles, re-downsample things on the fly. If we had to that for meshes, that's seriously complicated.

We definitely want to support painting, (so editing image pixels), and we want to support adding points and drawing 2D polygons (which are simple meshes), I'm not sure if we'll need full interactive editing of multiscale 3D meshes(!!) but worth thinking some of this stuff through now for sure.

kevinyamauchi commented 4 years ago

I love both these concepts. I think we'd want to do the later with our points layer for example. Something like that has already been requested by a user who has lots of points overlaid on an image (from an image based transcriptomics experiment where you do lots of spot finding)

Yes! This would be amazing for points and shapes.

jni commented 4 years ago

@pwinston not to take away from the problem of large tiles, but I just want to caution that I would prefer not to automatically downsample/tile largeish 2D planes automatically if they fit in a texture. Once a 2D slice is on the OpenGL side, it is lightning fast and buttery smooth to interact with it and I doubt there's any amount of clever multiscale tile fetching that can match this. One of my earliest "wow" experiences with napari was loading up a user's 7k x 7k image (that's it, no sliders) and zooming in and out of various sections of it to understand the content. (This was on the plane from Melbourne to Brisbane with @royerloic, he might remember it.)

Additionally, in many cases right now, the CPU->GPU transfer is absolutely not the bottleneck. For example, I might have a remote zarr array that has 1K x 1K tiles, where each tile takes 3s to download and 0.125s to do the whole vispy dance. In these cases, there's just nothing at all you can do to make things faster objectively — you just have to make sure that the downloads are put in a queue and handled on a different thread from the slider and discarded if the slider moves again before the download is complete.

Another example is the MNIST example on my fork of noise2self here:

https://github.com/jni/noise2self/blob/napari-completed/notebooks/Intro%20to%20Neural%20Nets.ipynb

Here (cell 15-18) we are transferring 3 28x28 images to the GPU, which is probably completely within the 16ms budget. But because those images are loaded from disk by some external library, and one of those images then needs to go through a complex neural net computation before being rendered, we still end up waiting, and the slider is completely sluggish.

So I think the top priority is to make _set_view_slice operate on an asynchronous queue. In other words, if the slider itself has 16ms responsiveness, 125ms for flush_commands for 16k x 16k is basically fine. And I'm hoping almost all the other costs can be swallowed by improvements in VisPy, anyway! (Namely: support for all common dtypes, clim normalization in the shader.)

d-v-b commented 4 years ago

Additionally, in many cases right now, the CPU->GPU transfer is absolutely not the bottleneck. For example, I might have a remote zarr array that has 1K x 1K tiles, where each tile takes 3s to download and 0.125s to do the whole vispy dance. In these cases, there's just nothing at all you can do to make things faster objectively — you just have to make sure that the downloads are put in a queue and handled on a different thread from the slider and discarded if the slider moves again before the download is complete.

Maybe I'm misunderstanding, but in this scenario wont there be situations (e.g., random access browsing) where you are navigating from one cached-in-RAM-but-not-in-VRAM tile to another, and then you will be hit by the 125s latency? At some point your RAM cache fills up, and for all those tiles the .125s latency becomes very real. Or am I misunderstanding things? In any case, this is a great discussion!

pwinston commented 4 years ago

@jni I'm not sure what "make _set_view_slice operate on an asynchronous queue" means. From what I saw the slider does not queue up _set_view_slice's. If you drag the slider from 0 to 511 it does not draw all the slices from 0 to 511, that would be horrific. It will skip through drawing just a few slices.

So in my mind it's not queueing them, it's just slow at drawing slices. But I actually did not investigate moving the slider. I just looked at doing doing a single step. Since that was slow, it seems necessary to fix that, I don't see how the slider can be fast if each step it takes is slow?

I think I need to draw up diagrams/docs for what I mean by "tiled rendering" here. Basically a proposal for what I think we need to fix this issue and #845. It's a big project and certainly we want to make sure everyone understands it and is on board. On the other hand we don't want to do a big design up front, but we do need to agree it's worth doing.

What I'm calling "tiled rendering" decouples rendering from loading, so rendering never blocks. We only draw tiles that are in VRAM, so that's always fast. Today we block rendering while loading stuff into RAM from disk or network, it would no longer do that. And today we block rendering loading large textures into VRAM, it would no longer do that.

So in the proposed system rendering just never blocks, it always draws at 60Hz, the slider always moves freely. At the same time though stuff is loaded from disk into RAM and from RAM into VRAM as fast as possible.

This is not all that ambitious. Games and simulations routinely run at 60Hz (or much faster) for hours and hours as they load gigabytes and gigabytes of content from disk/network into RAM and from RAM into VRAM. They are constantly streaming massive amounts of geometry and textures, but they only draw what's the card, so it's always fast.

You mention 7k x 7k being fast today. It will be just as fast with tiles, because the tiles will all be in VRAM. Consider a 16k x 16k texture with 512x512 tiles. That's 1024 tiles so 2048 triangles. A game or simulation will draw millions of triangles per frame. Drawing 2048 triangles probably leaves 98% of the GPU silicon dark and unused!

I think a tiled renderer "scales to zero", it works for all cases small and large. But certainly benchmarking should be a big part of this project, and the project is simply not done if any case is still slow or laggy.

tlambert03 commented 4 years ago

If you drag the slider from 0 to 511 it does not draw all the slices from 0 to 511, that would be horrific. It will skip through drawing just a few slices.

I think this is a key point to nail down here. I've heard it feared in previous discussions that this sort of computation backup was happening... but I've also wondered whether that's actually the case, and agree with @pwinston here (though, this seems like a pretty easy thing to "just prove").

sofroniewn commented 4 years ago

If you drag the slider from 0 to 511 it does not draw all the slices from 0 to 511, that would be horrific. It will skip through drawing just a few slices.

I think this is a key point to nail down here. I've heard it feared in previous discussions that this sort of computation backup was happening... but I've also wondered whether that's actually the case, and agree with @pwinston here (though, this seems like a pretty easy thing to "just prove").

I don't think the slider draws all 511 slices!!! But it can easily start falling behind through normal usage on things that are slow (and to @jni point will always be slow) to "render" (where right now "render" might include some crazy dask computation to "load" the data). For example, if i move the slider, then stop, then move in the other direction, then move again etc. I can easily get to the point where it looks like many calls to _set_view_slice have been requested and they are all backed up, say 15 calls or something. I think what @jni is asking for at the beginning now is really just a queue where we can know that we've requested 15 calls and just drop the first 14 and only execute the last one. I'm not sure how hard that is to do as a stand alone thing, or if it something that will fall our of the proposed tiled rendering scheme. Making sure that everyone is on the same page about the above scenario seems important though.

What I'm calling "tiled rendering" decouples rendering from loading, so rendering never blocks. We only draw tiles that are in VRAM, so that's always fast. Today we block rendering while loading stuff into RAM from disk or network, it would no longer do that. And today we block rendering loading large textures into VRAM, it would no longer do that.

So in the proposed system rendering just never blocks, it always draws at 60Hz, the slider always moves freely. At the same time though stuff is loaded from disk into RAM and from RAM into VRAM as fast as possible.

I guess not blocking on loading into VRAM is they key that means the scenario above with the slider will always be fast. I think the point that @jni was making was sometimes we'll have a lazy computation setup to go RAM -> RAM that might be very slow, and we can't get blocked on that (and want to drop excess calls).

You mention 7k x 7k being fast today. It will be just as fast with tiles, because the tiles will all be in VRAM. Consider a 16k x 16k texture with 512x512 tiles. That's 1024 tiles so 2048 triangles. A game or simulation will draw millions of triangles per frame. Drawing 2048 triangles probably leaves 98% of the GPU silicon dark and unused!

Yes, I think we will still preserve this speed, I think the experience that @jni liked with the 7k x 7k image will be even better as we will progressively load it (I think?) so that you see something low res right away, but then fairly quickly the high res thing, and then all the panning/ zooming is still being done on the GPU. I think what @jni wants to avoid is my rather poor "multiscale" code which right now forces you to go back to the CPU every time you pan/ zoom to fetch a new "tile".

I think I need to draw up diagrams/docs for what I mean by "tiled rendering" here. Basically a proposal for what I think we need to fix this issue and #845. It's a big project and certainly we want to make sure everyone understands it and is on board. On the other hand we don't want to do a big design up front, but we do need to agree it's worth doing.

I think a few diagrams would help @pwinston. I'm pretty excited about how all these conversations are going, but there's a lot going on so definitely good to make sure we're all keeping up and understanding.

pwinston commented 4 years ago

What's the repro case for these 14 frames? What data/steps? I don't see any queueing with this exact data. Do you see queueing with this data:

napari.view_image(
            np.random.random((21, 8192, 8192)), name='big 2D timeseries'
        )

That's the data for this issue. I do see the slider never catches up if you continuously move it. But if you stop the mouse dead I only see at most 1 intermediate frame. I can dribble around for minutes but when I stop, it quickly stops.

This makes sense: we're blocking the event loop, when it unblocks it sees the mouse has moved and generates a move event. Qt can't possibly queue lots of mouse moves, that would be pathological. It wouldn't just be a problem for us, every Qt app would break. If you blocked for a long time you'd get this long "movie" playing back cursor position over time. I've never seen that in any GUI really.

BTW "60Hz always" is only trivial for 2D images. My 5500M can draw 52 billion pixels/second but you only ever need to draw 8M pixels to fill the screen. 2D images are just dead easy. Now meshes, labels, shapes and points are a different story. There you can always choke the card if you have too much content. So there you need lots of other tricks. But you literally can't choke anything with 2D images because no matter how big your images are there's a fixed number of pixels on the screen.

pwinston commented 4 years ago

Here's a movie of what I see with above data. It continuously lags because it's drawing at 5Hz but when I stop it quickly stops dead: https://imgur.com/a/61IQEdu

I think the problem is that 5Hz is so slow it's basically a slide show, you don't even see it as an animation. If it were really queueing then it would get further and further behind without bound. Instead it's just always a fixed amount behind such that when you stop it stops too.

My overall claim is that nothing will feel fast until the draw is fast, so that should be the goal. If there were some way to make it feel fast with a slow draw, I think that would be surprising.

sofroniewn commented 4 years ago

That's the data for this issue. I do see the slider never catches up if you continuously move it. But if you stop the mouse dead I only see at most 1 intermediate frame. I can dribble around for minutes but when I stop, it quickly stops.

I see the same for this data set too, and I'm unable to reproduce a catch up of multiple frames with any of my examples when stopping.

Qt can't possibly queue lots of mouse moves, that would be pathological. It wouldn't just be a problem for us, every Qt app would break. If you blocked for a long time you'd get this long "movie" playing back cursor position over time. I've never seen that in any GUI really.

Yeah, this makes complete sense. The alternative would be pretty nuts - I honestly felt like this had happened to me visually, but you're right I only see what looks like "1 intermediate frame" when I try it now. We used to have some problems with repeated calls to _set_view_slice when other properties were updated and I wonder if that could have contributed to my perception in some more complex example, but that's better now too and I can't make this happen - so my apologies for so boldly stating it above!

I think the problem is that 5Hz is so slow it's basically a slide show, you don't even see it as an animation. If it were really queueing then it would get further and further behind without bound. Instead it's just always a fixed amount behind such that when you stop it stops too.

Yup this also makes sense, and thanks for the clarification and checking this out.

Additionally, in many cases right now, the CPU->GPU transfer is absolutely not the bottleneck. For example, I might have a remote zarr array that has 1K x 1K tiles, where each tile takes 3s to download and 0.125s to do the whole vispy dance. In these cases, there's just nothing at all you can do to make things faster objectively — you just have to make sure that the downloads are put in a queue and handled on a different thread from the slider and discarded if the slider moves again before the download is complete.

I guess in response to this from @jni - nothing is queuing up (that was my mistake) but this one operation is just slow.

My overall claim is that nothing will feel fast until the draw is fast, so that should be the goal. If there were some way to make it feel fast with a slow draw, I think that would be surprising.

Ok yeah, I think I'm really getting this now. Thanks @tlambert03 and @pwinston for being patient with me. I was being slow, had too many things in my queue 🤣

perlman commented 4 years ago

This makes a lot of sense, I think we should also be thinking about making the tiled renderer able to render meshes (and so in turn our other napari layer types like "shapes" and "surface") in a high performance way, including multiscale meshes. @perlman was recently mentioning some highly optimized code for doing this in neuroglancer too.

I don't recall the context. I think I was referring to the dense oct-tree storage Neuroglancer uses to store multiscale meshes. Rendering is going to be universally limited by the total number of vertices...

Performance Monitoring Results

I love this!

The data.astype conversion from float64 for float32 in napari._vispy.VispyImageLayer._on_data_change The line data = (data - clim[0]) which triggers a copy in vispy.visuals.ImageVisual._build_texture The line data /= clim[1] - clim[0] in vispy.visuals.ImageVisual._build_texture

Painful. In the context of label data, I am still strongly in support of using GPU (shader) hashing (#204, #713, both stale).

pwinston commented 4 years ago

No problem @sofroniewn it's hard to parse exactly what's happening, you just know it feels slow and laggy. BTW, here's a proposed subjective scale for scientific imaging:

60Hz 😁 30Hz 🙂 20Hz 😑 10Hz 😞 5Hz 😡 1Hz 💀

For games 60Hz is more the minimum, but often in scientific viz the content is not moving. The "animation" is only for view manipulation, it's functional not aesthetic. I think for us 60Hz is great but really 20Hz is probably fine in most cases. But 5Hz is awful.

I will write up the basic ideas we've been talking about. I think solving this for 2d images is pretty straightforward. This is stuff people have been doing for decades. We don't have to invent anything just implement a decent version of it.

Solving it for 3d images, labels, points, shapes and meshes is another matter, but I think we have to start somewhere. I think the machinery we create for 2d images will be necessary for the other problems, but it won't be sufficient, especially for geometry.

tlambert03 commented 4 years ago

napari_FACES

perlman commented 4 years ago

@tlambert03 : I love this.

I know it's not the topic of this thread, but it may come up with asynchronous loading of remote chunked data sources (zarr): image tearing is terrible, too.

[This entire conversation brings back memories of poor CATMAID rendering performance. It used to be ~10Hz @ ~1080p, with tiles "tearing" as they loaded. With WebGL, it can do ~30Hz @ 4K.]

pwinston commented 4 years ago

This implies for OpenGL you can use multiple threads but there's really no point, it doesn't help performance because "there's usually just 1 GPU": https://www.khronos.org/opengl/wiki/OpenGL_and_multithreading

But this implies you can do it in some cases but it's complicated: https://developer.apple.com/library/archive/documentation/GraphicsImaging/Conceptual/OpenGL-MacProgGuide/opengl_threading/opengl_threading.html

I think you can use multiple threads in WebGL and Neuroglancer does.

At any rate I think these 2 design goals are so basic they'd benefit us while using any API:

Don't do IO or random compute in the render thread.
Break up large resources into smaller resources.

I'm going to focus on vispy+opengl only, but I just think this is heading a good direction regardless. I'll include links like these in the references. It'd be nice to a have good rendering page that linked to various rendering-related things.

pwinston commented 4 years ago

As an aside about frame rate, here is NVIDIA arguing you need their most expensive cards to run popular games at 240Hz to improve your K/D (kill-to-death-ratio): https://www.nvidia.com/en-us/geforce/news/geforce-gives-you-the-edge-in-battle-royale/

Once I saw a demo of a specialized display that ran at 600Hz. He had a 120Hz and 300Hz version and with his demo you could see 600Hz was in fact clearly better. But the demo was of rapidly turning fan blades! So the ideal frame rate depends highly on what you are looking at.

Related to that movies are historically 24Hz and what most people don't realize is that highly constrains what shots you can do. If panning you have to pan at a certain slow rate or it will look awful. Movie people just know this and plan accordingly. With digital they've shot some movies at 48Hz and 60Hz and it changes things a lot. But some people say they look awful and "fake"!

sofroniewn commented 4 years ago

Related to that movies are historically 24Hz and what most people don't realize is that highly constrains what shots you can do. If panning you have to pan at a certain slow rate or it will look awful. Movie people just know this and plan accordingly. With digital they've shot some movies at 48Hz and 60Hz and it changes things a lot. But some people say they look awful and "fake"!

So interesting! I never knew that, thanks for sharing 😄

jni commented 4 years ago

Re the slider performance, here's what I think is happening with the "one intermediate point"

start sliding. This immediately triggers the neighbor set_view_slice, but because it takes (say) 5 seconds per slice and runs on the main thread, the UI blocks. You're moving the mouse but the slider is stuck at it's original spot.
as soon as it unblocks, it moves to the current mouse position, and takes another 5s to render.

In other words, it's always moving to the current mouse position, and is one step behind if you're moving.

Don't do IO or random compute in the render thread.

I think this is low-hanging fruit. Here's the line where the image gets instantiated and blocks the UI when there's IO/compute associated with it:

https://github.com/napari/napari/blob/c04484fc08694d23b4790387c3618d9fb4fad00f/napari/layers/image/image.py#L527

and then here's where it gets set to data, which triggers a vispy draw:

https://github.com/napari/napari/blob/c04484fc08694d23b4790387c3618d9fb4fad00f/napari/layers/image/image.py#L538-L539

From what I can tell, we should have the first line trigger compute/IO on a new thread, and the Future get saved somewhere handy. When a new Future gets added to this 1-element queue, the older future is canceled and clobbered. Then there can be a 60Hz poll on whether the Future is done and the data_raw setting part can happen. Am I missing some steps @pwinston?

With digital they've shot some movies at 48Hz and 60Hz and it changes things a lot. But some people say they look awful and "fake"!

Yep I am totally among those. :joy: And don't get me started on televisions with "fancy" interpolation to make movement smoother!