scipy / scipy

SciPy library main repository
https://scipy.org
BSD 3-Clause "New" or "Revised" License
12.95k stars 5.16k forks source link

Different data prerequisites for stats.bartlett and stats.levene #9252

Closed ichitaka closed 5 years ago

ichitaka commented 6 years ago

So while i am capable of doing the bartlett test with two numpy-arrays of dimensions (83,10), i can't do the levene test with the exact same arrays. I receive the following error when calling stats.levene:

File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scipy/stats/morestats.py", line 1944, in levene Yci[j] = func(args[j]) ValueError: setting an array element with a sequence.

Code to reproduce:

from scipy import stats import numpy as np

test_a = np.random.beta(0.87, 0.77, (83,10)) test_b = np.random.beta(0.87, 0.77, (83,10))

print(stats.bartlett(test_a, test_b)) print(stats.levene(test_a, test_b))

akahard2dj commented 6 years ago

What version of scipy did you use?

ichitaka commented 6 years ago

Sorry i forgot that. Current version 1.1.0.

chrisb83 commented 6 years ago

yes, looks like levene can only handle 1d inputs, simplifying the code a bit, one the error is due to axis=0 when defining func.

k = len(args)
Ni = zeros(k)
Yci = zeros(k, 'd')
func = lambda x: np.median(x, axis=0)

for j in range(k):
    Ni[j] = len(args[j]) 
    Yci[j] = func(args[j]) 

func(args[j]) has shape (10, ) for test_a, test_b, hence, cannot be assigned to the array element. axis=0 is used in other parts of the function as well.

chrisb83 commented 6 years ago

the test is well-explained in one of the references, that should help to fix it

https://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm

chrisb83 commented 6 years ago

btw, I doubt that the results of bartlett is correct. Shouldn't this give the same result?

stats.bartlett(test_a, test_b)
stats.bartlett(test_a.flatten(), test_b.flatten())

len(args[j]) gives 83 in the first case, but there are 830 elements. Or do I miss sth?

josef-pkt commented 6 years ago

Shouldn't this give the same result?

No, it shouldn't. scipy.stats doesn't flatten anything unless axis=None.

I think in this case, the standard numpy broadcasting should apply. Each column of the first array is compared with the same column of the second array. The result should be vectorized across columns. If this is not the case, then the broadcasting in bartlett doesn't allow correctly for arrays with dimension greater than 1.

(you could check t-test to see whether and how it works there.)

ichitaka commented 6 years ago

If i understand the bartlett test correctly, it should either not allow my input or flatten it (in this case this should maybe be mention in documentation?).

josef-pkt commented 6 years ago

AFAICS, this is more likely a bug, i.e. bartlett doesn't have consistent and correct behavior for 2-d inputs.

var is missing an axis keyword https://github.com/scipy/scipy/blob/8f81b305ab3562f295422a7a5fdc27eec0e5f776/scipy/stats/morestats.py#L1866

incomplete review of functions for "unexpected" inputs

If i understand the bartlett test correctly, it should either not allow my input or flatten it

In many cases it is useful to run a vectorized test on all columns at the same time., e.g. many mean comparisons in ttest_indep, so the default for vectorized operations in scipy.stats is axis=0 in contrast to numpy's default axis=None and ravelling. (In stats we don't want to ravel heterogeneous columns in a "DataFrame", e.g. what's the variance if column 0 is price and column 1 is quantity purchased.)

chrisb83 commented 6 years ago

Given the current implementation, both functions only work for 1d input (bartlett returns a result for 2d input as well but it is not correct).

There are multiple options to fix this:

While there is an axis keyword for the t-test, most other test don't have it (KS, wilcoxon, shapiro, anderson_ksample), so there is no consistent treatment. In particular, anderson_ksample requires the k samples to be combined in a list instead of using *args.

josef-pkt commented 6 years ago

I think it's better to stick with axis=0 default in stats and raise otherwise.

I think many tests have most likely not been reviewed for multi-dimensional behavior.

to the items

  1. I would never expect scipy.stats functions to ravel by default, because it goes against the pre-existing convention.
  2. I think this is the best short term solution. If 2-d with axis=0 is incorrect, then raise.
  3. This is the best long term solution, if the function naturally vectorizes. If the algorithm can only handle 1-D, then I don't see much point to include a python loop to artificially vectorize. (using cython or numba would be different)