Open josephernest opened 7 months ago
Thanks @josephernest for the report.
@lagru can you take this one? 😃
Is there an implementation of 2D-array peak detection that could become an "official" scipy.signal.find_peaks_2d function?
As far as I'm aware there isn't but there has been some work in scikit-image to get there for 2D and more dimensions.
skimage.feature.peak_local_max
which works for n-dimensions and allows to reduce the peak number by distance via min_distance
similar to find_peaks(..., distance=...)
.skimage.morphology.local_maxima
, which also allows for ND and especially defining the connectivity in higher dimensions. There's no peak filtering included in this function.skimage.morphology.remove_small_objects
For prominence filtering there hasn't been anything that I'm aware of. I've thought about tackling that after https://github.com/scikit-image/scikit-image/pull/4165 though.
A quick search also brought up https://github.com/Xunius/python_peak_promience2d (via this ST answer) but I haven't investigated it closely yet.
Thanks a lot @lagru. Would you say this should be kept open as wanted or it's something to do in another library than SciPy.
Thanks @tupui and @lagru.
Would you say this should be kept open as wanted or it's something to do in another library than SciPy.
The rationale behind this was the fact that scipy.signal.find_peaks
is really popular (see e.g. https://stackoverflow.com/questions/1713335/peak-finding-algorithm-for-python-scipy), and I think people are looking for an equivalent of this for 2D arrays, possibly with the same Scipy API (prominence
, etc.).
Would you say this should be kept open as wanted or it's something to do in another library than SciPy.
That depends on whether you see this functionality as within scope of SciPy. I've never been really clear on the border between scipy.ndimage
and skimage
. I'd say keep this open. Might be a good indicator down the line for where people look for this functionality first.
Personally, it seems general enough to me to include it in SciPy with what's available already. And I wouldn't be opposed to migrating that code to SciPy. I will probably try to move this along within scikit-image though first and then look into whether it is worth moving later. It's would also be trivial to point users to scikit-image in find_peaks
' docstring and within an exception message.
Sounds good 👍 thanks 🙏
Fyi, I polished off https://github.com/scikit-image/scikit-image/pull/4165 and hope that we get the reviews to merge it soon. @josephernest, would that be a step in the right direction?
@tupui, that PR implements another algorithm in skimage that repeatedly calls on https://github.com/scipy/scipy/blob/7d4600e705f0929a77097d82d0d605aeb9c9bc50/scipy/spatial/_ckdtree.pyx#L850-L853
from within Cython code. Unfortunately, it returns a list which together with the Python-facing API makes it somewhat hard to release the GIL and do fast iteration. I didn't take a thorough look at all discussions here, but do you know of any efforts in SciPy to expose a C,C++-API like NumPy does with get_include
?
If we could wrap the functionality from cKDTree directly in Cython that would likely yield significant performance benefits. The only other solution I can think of would be to vendor it, which seems wasteful.
Thanks @lagru!
I am not sure there are discussions for that. But worth asking @tylerjereddy.
Oh gosh, I'm not aware of any plans to make our spatial
C/C++-level "API" consumable in a public sense. I guess there's some stuff happening like that in special
, but a decent chunk of spatial
is wrapped around Qhull, and the community also has the (more?) popular/competing license-incompatible CGAL, which is quite different to use I think.
I'm also not entirely clear on the route to performance improvement here, since the function can return ragged data structures, which is why the nested objects/lists are used. It already takes a workers
argument, is there no way for you to use concurrency to get some speedup that way after accumulating enough points? I'll take a look at the skimage
source in a minute to see if anything sticks out, and maybe ask for sample timings there.
We can also ask @sturlamolden of course.
And, not to derail even more, but sklearn
also has KDTree
. Maybe some opportunities to consolidate things somehow, but yikes on the amount of effort that would be to coordinate (I talked to Thomas Fan about that at some point at SciPy conference).
Thanks Tyler. I agree that it would be good if we could coordinate something with sklearn 👍
And, not to derail even more, but sklearn also has KDTree.
Oh, sklearn's KDTree returns arrays on query. I definitely need to check this out from a performance perspective. Maybe the biggest challenge with SciPy's current version is iterating the results quickly. The returned list or (array of lists) makes it hard to do fast Cython-indexing or using nogil.
Speaking of, I'm a bit surprised by this bit
The internal C++ implementation returns a vector which is then unpacked into a list whose size is even pre-allocated with m * [None]
. Maybe I'm missing context, but why not transform the C++ vector into a proper array instead?
I'd be happy to make a PR that adds an option to return a more efficient data structure. Though, I'd be happy to know what kind of API update you would prefer. A new function query_radius
, query_ball_point_arr
(like sklearn)? I'd prefer SciPy's kdtree long-term, as sklearn is only an optional dependency.
The internal C++ implementation returns a vector which is then unpacked into a list whose size is even pre-allocated with
m * [None]
. Maybe I'm missing context, but why not transform the C++ vector into a proper array instead?I'd be happy to make a PR that adds an option to return a more efficient data structure. Though, I'd be happy to know what kind of API update you would prefer. A new function
query_radius
,query_ball_point_arr
(like sklearn)? I'd prefer SciPy's kdtree long-term, as sklearn is only an optional dependency.
A list is returned because cKDTree was written to behave identical to KDTree, which was a pure Python implementation. Why @aarchiba chose to return this data structure from KDTree I do not know. A likely reason is that at the Python side the results would be a "jagged array" which ndarray do not support, except by having an ndarray with dtype=object and stacking ndarrays into an ndarray. In any case it would not allow for iteration without holding the GIL, as iterating something with dtype=object requires you to hold the GIL, and it would also incure a lot of reference counting.
Another oddity that has the same historical explanation is that the results from query_ball_point
and query_ball_tree
are transposed instead of identical.
I do now why I used a std::vector internally in the C++ code, though (obviously, since I wrote the code). That was to release the GIL while performing the query in C++, to precent cKDTree from locking up the interpreter. A common use of kd-trees are games and computer graphics, and having a GUI or CG that freezes is bad, even if it just means missing a few frames. That was the very reason I moved cKDTree to C++.
The C++ code returns a std::vector of std::vector
Basically to make this efficient, you need to create a cdef class that stores the std::vector< std::vector<npy_intp> >
internally and (optinally) return this to Python instead of a list. Then your Cython code that uses cKDTree.query_ball_tree
could grab the jagged array from this class and iterate without holding the GIL. It is actually rather trivial to implement.
The internal C++ implementation returns a vector which is then unpacked into a list whose size is even pre-allocated with
m * [None]
. Maybe I'm missing context, but why not transform the C++ vector into a proper array instead?
The short ansver is numpy.ndarray does not support jagged 2D arrays, so you cannot just call PyArray_SimpleNewFromData
.
Thanks for the insights! :) You totally right, I glossed over the fact, that each point may return a different number of neighbors. If I understand this correctly, a list of arrays would work for the jagged case? That might already be a lot easier to iterate.
Basically to make this efficient, you need to create a cdef class that stores the
std::vector< std::vector<npy_intp> >
internally and (optinally) return this to Python instead of a list. Then your Cython code that uses cKDTree.query_ball_tree could grab the jagged array from this class and iterate without holding the GIL. It is actually rather trivial to implement.
To clarify, are you suggesting to do this on skimage's or SciPy's side? I thought about using the internal C++ implementation more directly from our side, but wasn't sure how to do that.
Thanks for the insights! :) You totally right, I glossed over the fact, that each point may return a different number of neighbors. If I understand this correctly, a list of arrays would work for the jagged case? That might already be a lot easier to iterate.
A list of arrays would work. However, you would need to hold the GIL while iterating over the list, and you would incure increfs and defcrefs for each array in the list. You would also need to use the Python buffer API (slow) or NumPy C API (faster) to grab the data from each array. Basically this will be slow too.
Basically to make this efficient, you need to create a cdef class that stores the
std::vector< std::vector<npy_intp> >
internally and (optinally) return this to Python instead of a list. Then your Cython code that uses cKDTree.query_ball_tree could grab the jagged array from this class and iterate without holding the GIL. It is actually rather trivial to implement.To clarify, are you suggesting to do this on skimage's or SciPy's side? I thought about using the internal C++ implementation more directly from our side, but wasn't sure how to do that.
If you are going to do this in skimage you might consider just making a small C or C++ kd-tree that will just to what you need and nothing else.
The ABI of scipy.spatial.cKDTree is not frozen, so using it with custom C++ would be fragile.
Making a tiny kd-tree for the purpose of range query is very easy. So my advice is do that.
As stated in the documentation,
scipy.signal.find_peaks
works for 1D signals.There are many attempts of similar function for 2D arrays, such as https://stackoverflow.com/questions/3684484/peak-detection-in-a-2d-array, https://stackoverflow.com/questions/16842823/peak-detection-in-a-noisy-2d-array, (or other methods in sklearn: https://scikit-image.org/docs/stable/auto_examples/segmentation/plot_peak_local_max.html), etc.
Is there an implementation of 2D-array peak detection that could become an "official"
scipy.signal.find_peaks_2d
function?This would be interesting in many scientific purposes.
Have there been discussions about adding 2D peak detection in
scipy
, withprominence
like in scipy.signal.find_peaks?