Closed jni closed 2 weeks ago
yeah but to 50ms. If the process is slower than that, you overwhelm the GUI. Which is the case in the surface layer.
Maybe surface code could be optimized? Or increase this time to 100ms?
I agree with @brisvag that requiring a stop mouse to update the status is a bad decision.
Some quick profiling on my laptop shows get_value
for the model in the example takes around 170-200 ms. I was able to get it down to ~150 ms with some micro-optimizations but I don't see an easy way to a 3x improvement right now. It's not a major surprise where most of the time is spent:
geometry.py:17(project_points_onto_plane)
geometry.py:358(inside_triangles)
geometry.py:608(line_in_triangles_3d)
Make sure that get_status runs in a separate thread so it doesn't interfere with the GUI responsiveness
Monitor how long a get_status call takes, and
Use debounce and/or throttle (probably debounce) to make sure it doesn't get called more than that.
I prefer debounce, at least for long-running get_status, because if get_status takes a while to run, then throttling it will only show an outdated status message anyway, defeating the purpose of it and wasting precious CPU cycles.
I think 1. is the best (and probably easiest) way to handle this, unless we can simply optimize the code so well it becomes unnecessary. Still useful to do IMO.
Not sure how I feel about debounce vs throttle, I guess they both are outdated so debounce at least is not wasting resources. Either way, (if we go the threaded approach) we should probably signal somehow that the value is not currently up to date (similar to how we're working on the loading icon for layers and the new async).
@aganders3 How did you test it? I got:
%timeit viewer.layers[0].get_status((1.6522035388587697, 0.9330241210928516, 3.502586273276682))
23.2 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Is my machine so strong, or did I test it incorrectly?
Try something like this, I think it's how it's called when mousing over the surface:
%timeit viewer.layers[0].get_value((2.4136927335421827, 3.774940485066061, 4.213935782687853), view_direction=[-0.96592583, 0.25881905, -0.], dims_displayed=[0, 1, 2], world=True)
203 ms ± 13.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Ok. I'm able to reproduce the problem. The fast review suggests that code could be speedup by more matrix multiplication, but It requires a few hours of investigation in the algorithm.
But it looks like it may require the use of a different algorithm to find right triangles. Ray tracing is often used in games, so there should be faster algorithms.
I'm not an expert here, but I believe one of the largest gains for ray-tracing complex scenes is to put the objects (triangles) into a tree structure (bounding volume hierarchy (BVH) or kd-tree). I think this is the only way to get better-than-linear performance with respect to the number of triangles. I don't have a feel for how slow it might be to generate a BVH, but may experiment with this. I suppose it bodes well that our scene is generally not rapidly changing.
Another method maybe worth considering is to render the values into another buffer, and use glReadPixels()
. I think this would be a change in vispy.
In general, we should implement one of such structures for layers like Points or Surface. It will speed up multiple operations.
But this is long term.
Another method maybe worth considering is to render the values into another buffer, and use
glReadPixels()
. I think this would be a change in vispy.
This already exists in vispy ("picking mode"), and we could/should probably use it for all get_value
tasks; however, I never wrapped my head around it and I'm not sure how easy it is to make it work in napari.
EDIT: actually, maybe picking mode is only capable of returning "which visual" isbeing rendered, rather than "which vertex".
I hacked together a little prototype of a face-picking mode in vispy. On my laptop (M1 Pro) I get ~5-12 ms to pick, highlight, and re-render the mesh. The way this is done now actually wouldn't require changes to vispy, but it seems promising and I think it can be improved (especially w.r.t. GPU memory usage) by a change upstream.
Awesome, this is what I was hoping was possible!! :heart_eyes: totally support upstreaming to vispy, happy to help with that if needed :)
Questions:
- are you doing this continuously on hover, or sporadically on click?
Either works - I was registering with either on_mouse_move
or on_mouse_press
events in vispy. I don't think vispy has the same throttling options as napari, so there are some issues with on_mouse_move
as it triggers repeatedly even if you're moving within the same face.
- is it doable to get that info out from opengl, or is everythign happening internally in the shaders?
I get the info out of opengl by rendering into a numpy array without displaying it. Then I use the mouse position to index into that array and get the color, which encodes the face index in the array of faces.
Right now the hack is that the picking mesh is a whole separate mesh with the face colors calculated on CPU, so it duplicates the buffers sent to the GPU. This is what I meant by improving with some upstream code as I think the face colors could be calculated on the fly in the shader instead (saving memory and probably time). I think this can be done with gl_VertexID
or gl_PrimitiveID
or something?
- can we do the same with points?
I have only given this a little thought, but I think something similar would work.
- several small faces are highlighted in green in the first surface. Is that a color collision problem? Or am I misunderstanding the image?
Sorry I should have explained that image better. This is from a demo I made where the mesh is all purple to start, then if you hover (or click) a face it turns green (and stays green). The very small green triangles are from moving the mouse quickly over the mesh in the background where triangles are small.
Hi, I was working on the thread approach stated up above and could see considerable difference in terms of the surface being rendered on canvas during cursor movement and also in terms of user experience. Though this change attached below does not reduce the total computation time as of now, it specifically tries to address the decoupling of viewer and get_status computations making them concurrent instead. Please let me know your thoughts on this approach and also if we need to consider thinking about adaptive throttling or debouncing on top of this?
Would love to get some early feedback on approach, implementation or discussion here: https://github.com/akuten1298/napari/pull/4
Also attached in PR is the before vs after user experience recording.
Thank you :)
Hi @akuten1298, thanks for picking this up, awesome showcase! I'm definitely in favour of doing threading for this.
I'd like to hope that it's as simple as https://github.com/akuten1298/napari/pull/4, but I have a creeping fear that the threading police (@andy-sweet and @Czaki expecially :P) will have some worries about sharing of resourced between threads, race conditions and all that fun stuff ^^'. Thing is, get_status
has to access layer data and a lot of other sensitve/changeable state, so we probably need some locks and similar tricks. But I'm sure @andy-sweet and @Czaki can be of more help here.
Good points @brisvag - maybe we can benefit from some of @andy-sweet's async slicing work 🙂.
Thank you @brisvag for the early comments. Will make sure to check on this and get back on this thread with further updates regarding the threading approach.
@brisvag totally support upstreaming to vispy, happy to help with that if needed :)
I have opened a PR for "face picking" at https://github.com/vispy/vispy/pull/2500 (nice PR number!) if you have time to take a look.
Hi, I was working on porting over Ashley's work on vispy face_filter_picking onto Napari and have created a draft PR for an initial review from the community. Once again, we were able to significantly bring down the status bar computation time but we might have to still decide which approach we would like to proceed with. We have compared the runtimes independently for both approaches but might still need to have a common approach/script to compare them quickly and easily.
BVH PR: https://github.com/akuten1298/napari/pull/7
Face Picking PR: https://github.com/akuten1298/napari/pull/10
Please take a look and suggest feedback when you find time, thank you :)
@aganders3
at https://github.com/vispy/vispy/pull/2500 (nice PR number!)
it's nice, but I suspect it will take some time still for someone to beat my scikit-image/scikit-image#3000. 😃 (For those that don't know, Python 3 used to be called Python 3000.)
Gentle bump here -- any of those draft PRs able to get over the line?
@aganders3 Any thoughts how what to do with those draft PRs? They don't look far off from making a big impact?
Hm, I can take another look soon (vacation this week). I think the performance of the GPU version is closer than the BVH, but implementation is a bit awkward.
Even the simple multi-threaded version was a big UX improvement so maybe we should investigate that a little more?
Even the simple multi-threaded version was a big UX improvement so maybe we should investigate that a little more?
yes, just like we have a thread dedicated to slicing we could have a thread dedicated to picker updates. I imagine that a lot of the slicing machinery (e.g. discarding stale requests) can be reused here.
yes, just like we have a thread dedicated to slicing we could have a thread dedicated to picker updates. I imagine that a lot of the slicing machinery (e.g. discarding stale requests) can be reused here.
Yeah I think this would be great!
🧰 Task
In this discussion, @brisvag and @aganders3 figured out that the texture example had low framerate not because of VisPy and OpenGL, but because of layer.get_status. The simple workaround for now is to not have the layer selected when interacting with it, but ideally we should:
I prefer debounce, at least for long-running get_status, because if get_status takes a while to run, then throttling it will only show an outdated status message anyway, defeating the purpose of it and wasting precious CPU cycles.