rust-ndarray / ndarray-stats

Statistical routines for ndarray
https://docs.rs/ndarray-stats
Apache License 2.0
201 stars 25 forks source link

The assumed matrix layout for correlation is unintuitive #81

Open multimeric opened 3 years ago

multimeric commented 3 years ago

The docs for the correlation methods say:

Let (r, o) be the shape of M:

  • r is the number of random variables;
  • o is the number of observations we have collected for each random variable.

What this implicitly says is that "M should be a matrix with r rows, corresponding to random variables, and o columns, corresponding to observations". We know this because ndarray has an explicit definition for rows and columns, whereby the first axis refers to the rows and the second axis is called the column axis. For example refer to nrows and ncols functions.

However I find this assumption is counter-intuitive. The convention in my experience is to use the "tidy" layout which is that each row corresponds to an observation and each column corresponds to a variable. I refer here to Hadley Wickham's work, and this figure (e.g. here): image.

Also this is how R works:

> mat
     [,1] [,2]
[1,]    1    5
[2,]    2    6
[3,]    3    7
[4,]    4    8
> nrow(mat)
[1] 4
> ncol(mat)
[1] 2
> cov(mat)
         [,1]     [,2]
[1,] 1.666667 1.666667
[2,] 1.666667 1.666667

Thirdly, in terms of the Rust data science ecosystem, note that polars (as far as I know, the best supported data frame library in Rust) outputs matricies with the same assumptions. If you create a DataFrame with 2 series (which correspond to variables) and 3 rows, and run .to_ndarray(), you will get a (3, 2) ndarray. Then when you call .cov() on it, you will get something that is not the covariance matrix that you are after.

One argument in the defence of the current method is numpy.cov, which makes the same assumption, as it takes:

A 1-D or 2-D array containing multiple variables and observations. Each row of m represents a variable, and each column a single observation of all those variables.

My suggestions is therefore to consider reversing the assumed dimensions for these methods in the next major (breaking) release. I realise that using .t() is not a difficult thing to do, but unfortunately forgetting to do this in your code will result in a valid matrix that may continue into downstream code without the user realising that it is not the correct covariance matrix. This happened to me and I'd like to spare other users from this issue.

jturner314 commented 3 years ago

I've been somewhat unsatisfied with this for a while, too. It's designed after numpy.cov, but, like you, I'm not convinced that NumPy's default is the best option. I don't think we should just swap the axes, though, because that could break existing code in a hard-to-detect way and would be confusing for users coming from NumPy. IMO, we should add an axis parameter, which would be similar to the axis parameter of mean_axis:

/// Returns the covariance matrix for a 2-dimensional array of observations.
///
/// `axis` specifies the axis of the array to sum over. In other words, if `axis` is `Axis(0)`,
/// then each row is a single observation; if `axis` is `Axis(1)`, then each column
/// is a single observation. This is analogous to the `axis` parameter of [`ArrayBase::mean_axis`].
fn cov(&self, axis: Axis, ddof: A) -> Result<Array2<A>, EmptyInput>
where
    A: Float + FromPrimitive,
{}

This would also generalize well to inputs with more than 2 dimensions in the future.

What do you think?

multimeric commented 3 years ago

On further thought, it seems that languages determine their cov() API based on the row/column majorness. R and Julia are column-major so it's more efficient to apply variable-wise operations down the second axis, while numpy (and Rust's ndarray) is row-major by default so encourages you to store your variables in rows and apply down the first axis. And I haven't thought about this a whole lot, but I think the matrix multiplication required to calculate cov() would actually benefit from contiguous access, so we are either incentivised to encourage rows-as-variables, or ideally (but much higher friction involved) encourage users to use column-major order (F mode in ndarray) and then treat columns-as-variables. This latter option is ideal from the perspective of performance (continguous column access) and also API for statisticians because it's more intuitive (see my original post), but unfortunately would be a breaking change.

However if you aren't willing to change the API in this way (which like I said, could wait until the next major release as it would be a breaking change), then we have two options. The first option is to use an axis parameter as you suggested (though ideally its type should be an Into<Option<Axis>> which defaults to Axis(1) I think, in order to make it user friendly). However there is also an argument for not doing this, as it would introduce a new option when there are already ways to do the same thing, which wouldn't be considered Pythonic in the Python world. Instead we could explain in the docs that a) we assume rows-are-variables everywhere and encourage users to do this, maybe even PR polars to produce ndarrays in this form, or b) Encourage users to use mat.t().cov() as the canonical way to do this, which I assume is not ideal from a performance perspective.

Thoughts?