Possible performance increase for `Variogram. _calc_groups()`.

MartinSchobben commented 1 year ago

I ran into this problem while dealing with a large amount of data. I wonder whether replacing the for loop in Variogram. _calc_groups(). with numpy.digitize() might be an option. Not sure if the output is the desired format but it would surely be fast. See below for more details and some benchmark output:

import numpy as np
import timeit

# distance
d = np.arange(0, 1e6)
# get the bin edges and distances
bin_edges = np.linspace(0, d.max(), int(1e5))

n = 5

# for loop
def for_loop_solution():
    # -1 is the group fir distances outside maxlag
    groups = np.ones(len(d), dtype=int) * -1

    for i, bounds in enumerate(zip([0] + list(bin_edges), bin_edges)):
        groups[np.where((d >= bounds[0]) & (d < bounds[1]))] = i

    print("Object `groups`: ", groups , " of size: ", len(groups))

result = timeit.timeit(stmt='for_loop_solution()', globals=globals(), number=n)

# calculate the execution time
# get the average execution time
print(f"Execution time is {result / n} seconds")

# numpy
def numpy_solution():
    groups = np.digitize(d, bin_edges)
    print("Object `groups`: ", groups , " of size: ", len(groups))

result = timeit.timeit(stmt='numpy_solution()', globals=globals(), number=n)

# calculate the execution time
# get the average execution time
print(f"Execution time is {result / n} seconds")

Output:

Object `groups`:  [    1     1     1 ... 99999 99999    -1]  of size:  1000000
Object `groups`:  [    1     1     1 ... 99999 99999    -1]  of size:  1000000
Object `groups`:  [    1     1     1 ... 99999 99999    -1]  of size:  1000000
Object `groups`:  [    1     1     1 ... 99999 99999    -1]  of size:  1000000
Object `groups`:  [    1     1     1 ... 99999 99999    -1]  of size:  1000000
Execution time is 103.20237362137996 seconds
Object `groups`:  [     1      1      1 ...  99999  99999 100000]  of size:  1000000
Object `groups`:  [     1      1      1 ...  99999  99999 100000]  of size:  1000000
Object `groups`:  [     1      1      1 ...  99999  99999 100000]  of size:  1000000
Object `groups`:  [     1      1      1 ...  99999  99999 100000]  of size:  1000000
Object `groups`:  [     1      1      1 ...  99999  99999 100000]  of size:  1000000
Execution time is 0.01940285782329738 seconds

mmaelicke commented 1 year ago

Hi Martin,

That looks more than promising. Yeah, I was always aware of that bottleneck, but never found the time to solve it. So I am more than thrilled to implement your solution. I hope to do that soon. If the implementation is urgent for you, you can also file a PR or let me know and I try to hurry up a little bit.

Best, Mirko

MartinSchobben commented 1 year ago

Hi Mirko,

There is no real hurry. I was peeling apart your code as an example for an application meant for internal usage at the TU Wien. I like to have variograms for ASCAT soil moisture data. But there I need some more performance, so I was profiling the code and looking to build in some memoizing functionality for e.g. methods that contain expensive calculations. That is why I came across this and I thought I let you know.

Perhaps I can show you this code some time, also to check if credits are due somewhere for if it should go public.

I would also be happy to try and turn the above into a PR.

Kind regards, Martin

mmaelicke commented 1 year ago

Yeah, nice. Then I guess the best way to proceed is that you open a PR with your code, which also makes you a collaborator (and co-author of the Zenodo publication). That way credit is attributed clearly and transparently. I can also assist with the PR, if you feel like I can be helpful.

MartinSchobben commented 1 year ago

Yes, and I should list you probably as contributor (or cite your work) on my ASCAT application if it would go public. As I am heavily relying on your knowledge of spatial autocorrelation.

mmaelicke commented 1 year ago

Yes, and I should list you probably as contributor (or cite your work) on my ASCAT application if it would go public. As I am heavily relying on your knowledge of spatial autocorrelation.

Just to let you know: There is also a model description paper on GMD, which you can cite if you use the package, DOI: 10.5194/gmd-15-2505-2022 URL: https://gmd.copernicus.org/articles/15/2505/2022/gmd-15-2505-2022.html

mmaelicke / scikit-gstat

Possible performance increase for `Variogram. _calc_groups()`. #145