scipy / scipy

SciPy library main repository
https://scipy.org
BSD 3-Clause "New" or "Revised" License
13.08k stars 5.19k forks source link

ENH: scipy.signal.find_peaks for 2D signals #20489

Open josephernest opened 7 months ago

josephernest commented 7 months ago

As stated in the documentation, scipy.signal.find_peaks works for 1D signals.

There are many attempts of similar function for 2D arrays, such as https://stackoverflow.com/questions/3684484/peak-detection-in-a-2d-array, https://stackoverflow.com/questions/16842823/peak-detection-in-a-noisy-2d-array, (or other methods in sklearn: https://scikit-image.org/docs/stable/auto_examples/segmentation/plot_peak_local_max.html), etc.

Is there an implementation of 2D-array peak detection that could become an "official" scipy.signal.find_peaks_2d function?

This would be interesting in many scientific purposes.

Have there been discussions about adding 2D peak detection in scipy, with prominence like in scipy.signal.find_peaks?

tupui commented 7 months ago

Thanks @josephernest for the report.

@lagru can you take this one? 😃

lagru commented 7 months ago

Is there an implementation of 2D-array peak detection that could become an "official" scipy.signal.find_peaks_2d function?

As far as I'm aware there isn't but there has been some work in scikit-image to get there for 2D and more dimensions.

For prominence filtering there hasn't been anything that I'm aware of. I've thought about tackling that after https://github.com/scikit-image/scikit-image/pull/4165 though.

A quick search also brought up https://github.com/Xunius/python_peak_promience2d (via this ST answer) but I haven't investigated it closely yet.

tupui commented 6 months ago

Thanks a lot @lagru. Would you say this should be kept open as wanted or it's something to do in another library than SciPy.

josephernest commented 6 months ago

Thanks @tupui and @lagru.

Would you say this should be kept open as wanted or it's something to do in another library than SciPy.

The rationale behind this was the fact that scipy.signal.find_peaks is really popular (see e.g. https://stackoverflow.com/questions/1713335/peak-finding-algorithm-for-python-scipy), and I think people are looking for an equivalent of this for 2D arrays, possibly with the same Scipy API (prominence, etc.).

lagru commented 6 months ago

Would you say this should be kept open as wanted or it's something to do in another library than SciPy.

That depends on whether you see this functionality as within scope of SciPy. I've never been really clear on the border between scipy.ndimage and skimage. I'd say keep this open. Might be a good indicator down the line for where people look for this functionality first.

Personally, it seems general enough to me to include it in SciPy with what's available already. And I wouldn't be opposed to migrating that code to SciPy. I will probably try to move this along within scikit-image though first and then look into whether it is worth moving later. It's would also be trivial to point users to scikit-image in find_peaks' docstring and within an exception message.

tupui commented 6 months ago

Sounds good 👍 thanks 🙏

lagru commented 6 months ago

Fyi, I polished off https://github.com/scikit-image/scikit-image/pull/4165 and hope that we get the reviews to merge it soon. @josephernest, would that be a step in the right direction?

@tupui, that PR implements another algorithm in skimage that repeatedly calls on https://github.com/scipy/scipy/blob/7d4600e705f0929a77097d82d0d605aeb9c9bc50/scipy/spatial/_ckdtree.pyx#L850-L853

from within Cython code. Unfortunately, it returns a list which together with the Python-facing API makes it somewhat hard to release the GIL and do fast iteration. I didn't take a thorough look at all discussions here, but do you know of any efforts in SciPy to expose a C,C++-API like NumPy does with get_include?

If we could wrap the functionality from cKDTree directly in Cython that would likely yield significant performance benefits. The only other solution I can think of would be to vendor it, which seems wasteful.

tupui commented 6 months ago

Thanks @lagru!

I am not sure there are discussions for that. But worth asking @tylerjereddy.

tylerjereddy commented 6 months ago

Oh gosh, I'm not aware of any plans to make our spatial C/C++-level "API" consumable in a public sense. I guess there's some stuff happening like that in special, but a decent chunk of spatial is wrapped around Qhull, and the community also has the (more?) popular/competing license-incompatible CGAL, which is quite different to use I think.

I'm also not entirely clear on the route to performance improvement here, since the function can return ragged data structures, which is why the nested objects/lists are used. It already takes a workers argument, is there no way for you to use concurrency to get some speedup that way after accumulating enough points? I'll take a look at the skimage source in a minute to see if anything sticks out, and maybe ask for sample timings there.

tylerjereddy commented 6 months ago

We can also ask @sturlamolden of course.

tylerjereddy commented 6 months ago

And, not to derail even more, but sklearn also has KDTree. Maybe some opportunities to consolidate things somehow, but yikes on the amount of effort that would be to coordinate (I talked to Thomas Fan about that at some point at SciPy conference).

tupui commented 6 months ago

Thanks Tyler. I agree that it would be good if we could coordinate something with sklearn 👍

lagru commented 6 months ago

And, not to derail even more, but sklearn also has KDTree.

Oh, sklearn's KDTree returns arrays on query. I definitely need to check this out from a performance perspective. Maybe the biggest challenge with SciPy's current version is iterating the results quickly. The returned list or (array of lists) makes it hard to do fast Cython-indexing or using nogil.

Speaking of, I'm a bit surprised by this bit

https://github.com/scipy/scipy/blob/8431e12346ac9564e244bfce920dc8a04bcabbd9/scipy/spatial/_ckdtree.pyx#L962

https://github.com/scipy/scipy/blob/8431e12346ac9564e244bfce920dc8a04bcabbd9/scipy/spatial/_ckdtree.pyx#L983-L989

The internal C++ implementation returns a vector which is then unpacked into a list whose size is even pre-allocated with m * [None]. Maybe I'm missing context, but why not transform the C++ vector into a proper array instead?

I'd be happy to make a PR that adds an option to return a more efficient data structure. Though, I'd be happy to know what kind of API update you would prefer. A new function query_radius, query_ball_point_arr (like sklearn)? I'd prefer SciPy's kdtree long-term, as sklearn is only an optional dependency.

sturlamolden commented 6 months ago

The internal C++ implementation returns a vector which is then unpacked into a list whose size is even pre-allocated with m * [None]. Maybe I'm missing context, but why not transform the C++ vector into a proper array instead?

I'd be happy to make a PR that adds an option to return a more efficient data structure. Though, I'd be happy to know what kind of API update you would prefer. A new function query_radius, query_ball_point_arr (like sklearn)? I'd prefer SciPy's kdtree long-term, as sklearn is only an optional dependency.

A list is returned because cKDTree was written to behave identical to KDTree, which was a pure Python implementation. Why @aarchiba chose to return this data structure from KDTree I do not know. A likely reason is that at the Python side the results would be a "jagged array" which ndarray do not support, except by having an ndarray with dtype=object and stacking ndarrays into an ndarray. In any case it would not allow for iteration without holding the GIL, as iterating something with dtype=object requires you to hold the GIL, and it would also incure a lot of reference counting.

Another oddity that has the same historical explanation is that the results from query_ball_pointand query_ball_tree are transposed instead of identical.

I do now why I used a std::vector internally in the C++ code, though (obviously, since I wrote the code). That was to release the GIL while performing the query in C++, to precent cKDTree from locking up the interpreter. A common use of kd-trees are games and computer graphics, and having a GUI or CG that freezes is bad, even if it just means missing a few frames. That was the very reason I moved cKDTree to C++.

The C++ code returns a std::vector of std::vector, which is a simple way of making a jagged 2D array in C++.

Basically to make this efficient, you need to create a cdef class that stores the std::vector< std::vector<npy_intp> > internally and (optinally) return this to Python instead of a list. Then your Cython code that uses cKDTree.query_ball_tree could grab the jagged array from this class and iterate without holding the GIL. It is actually rather trivial to implement.

sturlamolden commented 6 months ago

The internal C++ implementation returns a vector which is then unpacked into a list whose size is even pre-allocated with m * [None]. Maybe I'm missing context, but why not transform the C++ vector into a proper array instead?

The short ansver is numpy.ndarray does not support jagged 2D arrays, so you cannot just call PyArray_SimpleNewFromData.

https://en.wikipedia.org/wiki/Jagged_array

lagru commented 6 months ago

Thanks for the insights! :) You totally right, I glossed over the fact, that each point may return a different number of neighbors. If I understand this correctly, a list of arrays would work for the jagged case? That might already be a lot easier to iterate.

Basically to make this efficient, you need to create a cdef class that stores the std::vector< std::vector<npy_intp> > internally and (optinally) return this to Python instead of a list. Then your Cython code that uses cKDTree.query_ball_tree could grab the jagged array from this class and iterate without holding the GIL. It is actually rather trivial to implement.

To clarify, are you suggesting to do this on skimage's or SciPy's side? I thought about using the internal C++ implementation more directly from our side, but wasn't sure how to do that.

sturlamolden commented 6 months ago

Thanks for the insights! :) You totally right, I glossed over the fact, that each point may return a different number of neighbors. If I understand this correctly, a list of arrays would work for the jagged case? That might already be a lot easier to iterate.

A list of arrays would work. However, you would need to hold the GIL while iterating over the list, and you would incure increfs and defcrefs for each array in the list. You would also need to use the Python buffer API (slow) or NumPy C API (faster) to grab the data from each array. Basically this will be slow too.

Basically to make this efficient, you need to create a cdef class that stores the std::vector< std::vector<npy_intp> > internally and (optinally) return this to Python instead of a list. Then your Cython code that uses cKDTree.query_ball_tree could grab the jagged array from this class and iterate without holding the GIL. It is actually rather trivial to implement.

To clarify, are you suggesting to do this on skimage's or SciPy's side? I thought about using the internal C++ implementation more directly from our side, but wasn't sure how to do that.

If you are going to do this in skimage you might consider just making a small C or C++ kd-tree that will just to what you need and nothing else.

The ABI of scipy.spatial.cKDTree is not frozen, so using it with custom C++ would be fragile.

Making a tiny kd-tree for the purpose of range query is very easy. So my advice is do that.