spatstat / spatstat.geom

Sub-package of spatstat defining spatial data and spatial/geometrical operations
7 stars 4 forks source link

Supply dmax argument to nncross.ppp function #18

Closed sthawinke closed 9 months ago

sthawinke commented 9 months ago

Hello,

I appreciate the fast implementation in C of the nncross.ppp function, but unfortunately the call dmax <- diameter(boundingbox(as.rectangle(X), as.rectangle(Y))) is consuming most of the time when calling this function repeatedly on the same point pattern for different features. If I could precalculate dmax that would really speed up things. Any chance you could add dmax = diameter(boundingbox(as.rectangle(X), as.rectangle(Y))) as default argument to the header so I can override it?

Many thanks,

Stijn

baddstats commented 9 months ago

I'm surprised that you're finding this line takes a long time to compute. Thanks for drawing attention to this.

I have rewritten this line more efficiently in nncross.ppp and also accelerated the code for diameter.owin for the case of rectangles. Please try it in the development version spatstat.geom 3.2-8.001.

I am reluctant to add an argument dmax unless it is really necessary. Let's first see if the code is now fast enough for your needs.

sthawinke commented 9 months ago

Thanks for the update! It has indeed practically eliminated the computation time for boundingbox and diameter. Yet for repeated calls to nncross.ppp on small point patterns, the as.rectangle function still takes most of the time. I used the following benchmark:

library(profvis)
library(spatstat)
bdry <- list(x=c(0.1,0.3,0.7,0.4,0.2), y=c(0.1,0.1,0.5,0.7,0.3))
w <- owin(poly=bdry)
X = runifpoint(1e2, w)
Y = runifpoint(1e2, w)
profvis(replicate(nncross(X, Y, what = "dist"), n = 1e4))

This scenario is quite common in genomics data where many feature combinations are tested in the same point pattern. My code is still much faster if I can bypass the dmax calculation. I have tried to call the underlying C-function directly, but I am struggling to incorporate this cleanly in a package (see https://stackoverflow.com/questions/77881302/c-function-from-spatstat-geom-package-not-available-in-r-although-it-is-expo?noredirect=1#comment137299936_77881302). Do you think this is a viable alternative?

rubak commented 9 months ago

Do you have your own C-code in a package and want to call the spatstat C-code from your own C-code? Or are you only talking about calling the spatstat C-code as a function call from R? I think the former is difficult, but the latter should be doable. As I understand your post on StackOverflow the problem is that you need to use ::: to access the function which is not allowed in a (CRAN) package. Is that correct? I don't know right now if it is a viable solution for us to export the C-function.

baddstats commented 9 months ago

Could you please try the code again in the new update spatstat.geom 3.2-8.002 ?

If as.rectangle is taking a substantial amount of time, then that is a big concern for the whole of spatstat, which relies heavily on this low level function. So I would really like to get to the bottom of this. In the latest update to spatstat.geom I have accelerated owin and as.rectangle and I hope that this will resolve the problem.

In any case I will investigate whether we can add the dmax argument as you requested.

baddstats commented 9 months ago

I've also added a dmax argument to nncross.ppp in spatstat.geom 3.2-8.004 as an experiment.

rubak commented 9 months ago

I can confirm that the new version of spatstat.geom cuts computation time of the original example in half from ~4.2s to ~2.1s on my computer. Adding a dmax-value cuts another ~.5s for a total time of 1.6s. Finally, sorting X and Y and using the sorting arguments saves almost another second for a final computation time of ~0.6s (not counting the time to do the sorting once).

baddstats commented 9 months ago

I have accelerated the code further, by effectively setting dmax=sqrt(.Machine$double.xmax) which is the largest possible distance computable in the C code. The function nncross.ppp does not need an argument dmax any more, so it has been removed. This should be faster than the versions described above. This is in spatstat.geom 3.2-8.006

rubak commented 9 months ago

I can confirm that the new code without an dmax argument, is equally fast as a user-provided dmax in the previous version, so I think the problem is solved as well as we can, and the issue should be closed. Are there any loose ends @sthawinke ?

sthawinke commented 9 months ago

I have run a quick benchmark on my machine with the code I gave above:

Cran version: 11.4 s Latest GitHub version: 1.7s Latest GitHub version with presorting 0.32s

Issue solved indeed, thanks for tackling it thoroughly