neurodata / hyppo

Python package for multivariate hypothesis testing
https://hyppo.neurodata.io/
Other
218 stars 88 forks source link

Improve speed of FriedmanRafsky code #403

Closed quant12345 closed 1 year ago

quant12345 commented 1 year ago

Closes issue #402

After discussion here:

I would like to improve the function code: prim from the class: FriedmanRafsky.

Since the function has a decorator: @jit(nopython=True, cache=True)using numba, the first start is warming up and lasts a long time. For a 100 x 100 array, it turned out 30 - 120 times slower than my version on the first run. True, subsequent launches are about 5 times faster than my version.

If you do not use numba, then the speedup is about 29 times with an array of 500 x 500. The speedup starts with an array size starting from 20 x 20.

netlify[bot] commented 1 year ago

Deploy Preview for hyppo ready!

Name Link
Latest commit e96a4564e31bf61cf2b62bddbb21bb88fbcfacac
Latest deploy log https://app.netlify.com/sites/hyppo/deploys/64fb1b019754840008b7fb2e
Deploy Preview https://deploy-preview-403--hyppo.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

sampan501 commented 1 year ago

Tests are down. I assume that the 'MST' function that calls 'prim' has a numba decorator. I tried locally without the decorator, the function worked without errors. I'll try to make the 'MST' function without loops so as not to use a decorator. If possible, show sample data: x, labels for the MST function.

Sounds good. After everything works, I'd like to see the wall times figure again as shown in #402

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.01% :tada:

Comparison is base (9e7fc39) 96.71% compared to head (e96a456) 96.72%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #403 +/- ## ========================================== + Coverage 96.71% 96.72% +0.01% ========================================== Files 86 86 Lines 3891 3907 +16 ========================================== + Hits 3763 3779 +16 Misses 128 128 ``` | [Files Changed](https://app.codecov.io/gh/neurodata/hyppo/pull/403?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=neurodata) | Coverage ฮ” | | |---|---|---| | [hyppo/independence/friedman\_rafsky.py](https://app.codecov.io/gh/neurodata/hyppo/pull/403?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=neurodata#diff-aHlwcG8vaW5kZXBlbmRlbmNlL2ZyaWVkbWFuX3JhZnNreS5weQ==) | `100.00% <100.00%> (รธ)` | |

:umbrella: View full report in Codecov by Sentry.

:loudspeaker: Have feedback on the report? Share it here.

quant12345 commented 1 year ago

I removed the numba decorators and its imports. I made a version of the 'MST' function without cycles, but did not add it to the commit. On the attached chart, the 'Rafsky_performance.py' script is running: the original is running without numba. Function MST_new(new_alg) is a variant without cycles, MST_old_prim_new(new_alg_MSTold) MST is calculated as before. Both options calculation the 'prim' function in a new way. It can be seen that the new version ' MST' begins to deteriorate sharply at the end.

The 'Rafsky_loop.py' script counts in cycles: displays the elapsed time of each algorithm. And the ratio of the original algorithm to one of the new algorithms (in this case, MST_new). As before, the first run in 70-130 slower on numba.

And the question is the list 'i_min' if it is empty, that is, there is no minimum, an error may occur, but this is in theory. I have run scripts dozens of times, this has never happened. Can put a check on an empty list, but this will slightly reduce performance.?

I am attaching both scripts (Rafsky.tar.gz). Rafsky.tar.gz Rafsky

quant12345 commented 1 year ago
n 100
time_original         0:00:02.309068 
time_MST_old_prim_new 0:00:00.041586 
time_new              0:00:00.026029
difference in time original / new     88.71136040570133
difference in time original / MST_old 55.52512864906459
*******************************************************
n 205
time_original         0:00:02.260544 
time_MST_old_prim_new 0:00:00.197153 
time_new              0:00:00.101197
difference in time original / new     22.33805349960967
difference in time original / MST_old 11.46593762204988
*******************************************************
n 450
time_original         0:00:02.429262 
time_MST_old_prim_new 0:00:01.229136 
time_new              0:00:01.046668
difference in time original / new     2.320947998792358
difference in time original / MST_old 1.976398055219276new

Tried several times to run the loop with different options. With an array size of 100, the first run is 50-130 times faster than numba. In this case, the calculation in numba lasts two seconds. With an array of 205, the difference is already 10 - 30 times faster, with 450 the difference is 2-3 times.

Probably even at the first start numba starts to accelerate after a certain time. Because with an array size of 10, the first run on numba takes 2.30 seconds and the same elapsed time for an array of 100.

Attached is the 'newRafsky_loop.py' file. In it, at the first iteration, the ratio of time to the original algorithm is considered, and vice versa at all subsequent iterations. The data is printed out.

If you will merge, there is a question. INF = 9999999 used to initialize the first minimum (in my opinion, this is the largest number, in fact it is: np.inf ).

Then sample < minimum is not needed. And instead of:

i_min = np.where((sample < minimum) & (sample == np.min(sample[np.nonzero(sample)])))

just need:

i_min = np.where(sample == np.min(sample[np.nonzero(sample)]))

newRafsky_loop.py.tar.gz

quant12345 commented 1 year ago

Tests are down. I assume that the 'MST' function that calls 'prim' has a numba decorator. I tried locally without the decorator, the function worked without errors. I'll try to make the 'MST' function without loops so as not to use a decorator. If possible, show sample data: x, labels for the MST function.

Sounds good. After everything works, I'd like to see the wall times figure again as shown in #402

Will there be any reaction to the changes?

quant12345 commented 1 year ago

@sampan501 P.S. I left in i_min only finding the minimum, because I think thatโ€™s what happens in the original, since INF is simply the largest number. And tests and comparisons give identical arrays. As a result, with numba the first launch is slower until about array size n - 450.

code tests ``` import numpy as np import perfplot from numba import jit @jit(nopython=True, cache=True) # pragma: no cover def prim(weight_mat, labels): INF = 9999999 V = len(labels) selected = np.zeros(len(labels)) no_edge = 0 selected[0] = True MST_connections = [] while no_edge < V - 1: minimum = INF x = 0 y = 0 for i in range(V): if selected[i]: for j in range(V): if (not selected[j]) and weight_mat[i][j]: # not in selected and there is an edge if minimum > weight_mat[i][j]: minimum = weight_mat[i][j] x = i y = j MST_connections.append([x, y]) selected[y] = True no_edge += 1 return MST_connections @jit(nopython=True, cache=True) def MST(x, labels): # pragma: no cover G = np.zeros((len(x[0]), len(x[0]))) for i in range(len(x[0])): for j in range(i + 1, len(x[0])): weight = np.linalg.norm(x[:, i] - x[:, j]) G[i][j] = weight G[j][i] = weight MST_connections = prim(G, labels) return MST_connections def prim_new(weight_mat, labels): V = len(labels) selected = np.zeros(len(labels)) no_edge = 0 selected[0] = True MST_connections = [] while no_edge < V - 1: rows_true = np.where(selected)[0] columns_false = np.where(np.logical_not(selected))[0] """ based on choose rows with True and columns with False make indexes 'ind' to fetch values from an array 'weight_mat'. """ ind = np.array(np.meshgrid(rows_true, columns_false, indexing='ij')).reshape(2, -1) sample = weight_mat[ind[0], ind[1]] """ 'i_min' index of the minimum from sample. With the help of which we extract the desired index corresponding to the 'weight_mat' array. """ i_min = np.where(sample == np.min(sample[np.nonzero(sample)])) x = ind[0][i_min][0] y = ind[1][i_min][0] MST_connections.append([x, y]) selected[y] = True no_edge += 1 return MST_connections def MST_old_prim_new(x, labels): # pragma: no cover G = np.zeros((len(x[0]), len(x[0]))) for i in range(len(x[0])): for j in range(i + 1, len(x[0])): weight = np.linalg.norm(x[:, i] - x[:, j]) G[i][j] = weight G[j][i] = weight MST_connections = prim_new(G, labels) return MST_connections def gen(n: int) -> list: print('Array size at a given iteration', n) weight_mat = np.random.uniform(low=-5, high=5, size=(n, n+2)) labels = np.random.choice([0, 1], n, p=[0.5, 0.5]) return [weight_mat, labels] def original(s: list): return MST(s[0], s[1]) def new_alg_MSTold(s: list): return MST_old_prim_new(s[0].copy(), s[1].copy()) def equality_check(x, y) -> bool: """ Checking for equality of results of two functions: 'original' and 'new_alg'. Is there a mismatch in at least one element, if so False. """ tf = np.all(np.round(x, 5) == np.round(y,5)) print('Comparison result', tf) return tf perfplot.show( setup=gen, kernels=[original, new_alg_MSTold], n_range=[2 ** k for k in range(1, 9)], equality_check=equality_check, xlabel="len(s)" ) ```

I also tried an array size of 4096 with numba and below is a graph with the results, where at the end the array size is 4096(with this size, the calculation of both algorithms lasted more than 20 minutes). with_numba4096

quant12345 commented 1 year ago

@quant12345 Thank you for your hard work. Once you add a unit test specific for your changes that verifies that given a mwe, you get the same answer, we should be good to merge

@sampan501

This is here I found the tests. Now I discovered that in the 'FriedmanRafsky class', the 'test' method has 'self.statistic' and it probably runs 'MST' multiple times, which runs 'prim'. If in reality this is the case, then my algorithm will lose to numba and my changes do not make sense. However, I am attaching a file with the old and new classes, the 'friedman_rafsky.py' file, which is used for tests.

Probably doesn't make sense.

code test_friedman_rafsky ``` import numpy as np import pytest from numpy.testing import assert_almost_equal, assert_raises, assert_warns from hyppo.tools import linear, power from friedman_rafsky import FriedmanRafsky_new, FriedmanRafsky import datetime class TestFriedmanRafskyStat_new: @pytest.mark.parametrize("n", [100]) @pytest.mark.parametrize("num_runs", [48]) @pytest.mark.parametrize("obs_stat", [-0.527]) @pytest.mark.parametrize("obs_pvalue", [0.748]) def test_linear_oned(self, n, num_runs, obs_stat, obs_pvalue): np.random.seed(123456789) x, y = linear(n, 1) num_rows, num_cols = x.shape y = np.random.choice([0, 1], num_rows, p=[0.5, 0.5]) y = np.transpose(y) now = datetime.datetime.now() stat1, pvalue1, _ = FriedmanRafsky_new().test(x, y) stat2 = FriedmanRafsky_new().statistic(x, y) print('new_time', datetime.datetime.now() - now) print('stat1', stat1, 'obs_stat', obs_stat) print('stat2', stat2, 'num_runs', num_runs) print('pvalue1', pvalue1, 'obs_pvalue', obs_pvalue) assert_almost_equal(stat1, obs_stat, decimal=2) assert_almost_equal(stat2, num_runs, decimal=2) assert_almost_equal(pvalue1, obs_pvalue, decimal=2) @pytest.mark.parametrize("n", [100, 200]) def test_rep(self, n): x, y = linear(n, 1) num_rows, num_cols = x.shape y = np.random.choice([0, 1], num_rows, p=[0.5, 0.5]) y = np.transpose(y) now = datetime.datetime.now() stat1, pvalue1, _ = FriedmanRafsky_new().test(x, y, random_state=2) stat2, pvalue2, _ = FriedmanRafsky_new().test(x, y, random_state=2) print('new_time', datetime.datetime.now() - now) print('new test_rep') print('stat1', stat1, 'pvalue1', pvalue1) print('stat2', stat2, 'pvalue2', pvalue2) assert stat1 == stat2 assert pvalue1 == pvalue2 class TestFriedmanRafskyStat: @pytest.mark.parametrize("n", [100]) @pytest.mark.parametrize("num_runs", [48]) @pytest.mark.parametrize("obs_stat", [-0.527]) @pytest.mark.parametrize("obs_pvalue", [0.748]) def test_linear_oned(self, n, num_runs, obs_stat, obs_pvalue): np.random.seed(123456789) x, y = linear(n, 1) num_rows, num_cols = x.shape y = np.random.choice([0, 1], num_rows, p=[0.5, 0.5]) y = np.transpose(y) print('old TestFriedmanRafskyStat') now = datetime.datetime.now() stat1, pvalue1, _ = FriedmanRafsky().test(x, y) stat2 = FriedmanRafsky().statistic(x, y) print('old_time', datetime.datetime.now() - now) print('stat1', stat1, 'obs_stat', obs_stat) print('stat2', stat2, 'num_runs', num_runs) print('pvalue1', pvalue1, 'obs_pvalue', obs_pvalue) assert_almost_equal(stat1, obs_stat, decimal=2) assert_almost_equal(stat2, num_runs, decimal=2) assert_almost_equal(pvalue1, obs_pvalue, decimal=2) @pytest.mark.parametrize("n", [100, 200]) def test_rep(self, n): x, y = linear(n, 1) num_rows, num_cols = x.shape y = np.random.choice([0, 1], num_rows, p=[0.5, 0.5]) y = np.transpose(y) now = datetime.datetime.now() stat1, pvalue1, _ = FriedmanRafsky().test(x, y, random_state=2) stat2, pvalue2, _ = FriedmanRafsky().test(x, y, random_state=2) print('old_time', datetime.datetime.now() - now) print('old test_rep') print('stat1', stat1, 'pvalue1', pvalue1) print('stat2', stat2, 'pvalue2', pvalue2) assert stat1 == stat2 assert pvalue1 == pvalue2 cl_new = TestFriedmanRafskyStat_new() cl_new.test_linear_oned(100, 48, -0.527, 0.748) cl_new.test_rep(100) cl = TestFriedmanRafskyStat() cl.test_linear_oned(100, 48, -0.527, 0.748) cl.test_rep(100) ```

friedman_rafsky.py.tar.gz

sampan501 commented 1 year ago

Yes that test method uses that to estimate a p-value