kyonifer / koma

A scientific computing library for Kotlin. https://kyonifer.github.io/koma
Other
270 stars 23 forks source link

Improvements to slicing #83

Closed peastman closed 5 years ago

peastman commented 6 years ago

I'd like to suggest several improvements to slicing.

Matrix lets you use any combination of Ints and IntRanges for slicing. It does this by providing all combinations of them in get() and set(): (Int, Int), (Int, IntRange), (IntRange, Int), and (IntRange, IntRange). That wouldn't be practical for NDArray, so instead it only has two versions. One takes vararg Int and returns an element, and the other takes vararg IntRange and returns an NDArray. That means you can't mix them. For example, if I want to extract a 1D slice from a 3D array I can't write

array[0, 0..10, 5]

Instead I have to write

array[0..0, 0..10, 5..5]

That's much less readable, and it creates more opportunities for errors. It also returns an array of the wrong shape. I wanted a 1D array, but instead it returns a 3D array of shape (1, 10, 1).

I suggest changing the signature form vararg IntRange to vararg Any so the user can mix and match Ints and IntRanges. That will mean relying on runtime type checking to throw an exception if the user passes a value of any other type, which isn't ideal. But I think it's a minor tradeoff compared to the much more important benefits.

Unlike Matrix, NDArray only supports slicing in get(), not set(), so you can't use it on the left side of an assignment. It should be supported in set() too.

Matrix allows you to use a negative number for the upper bound of a range, which it interprets relative to the width of the matrix. But it doesn't support it for the lower bound, so you can't write -5..-1 to get the last five elements. It also doesn't support negative values for specific indices passed as Ints, only for ranges. And NDArray doesn't support negative indices at all. I suggest allowing them everywhere in any method that takes an index.

drmoose commented 6 years ago

I would imagine in typical usage, you're generally not going to want to write a literal slice in a 15 dimensional array. With that in mind, what about supplying overloads for a certain number of dimensions? Since there's two chocies for each parameter, it'd have to be 2*(dimensions) overloads, which limits the extent to which we could do it (and we'd need a vararg version to catch the remaining cases), but it would have the nice side effect of letting us automatically upconvert to matrix where apporpriate, e.g.

operator fun NDArray<Double>.get(p1: Int, p2: IntRange, p3: IntRange, p4: Int): Matrix<Double>

would be a possible overload we could provide. This would also eliminate the possibility of getting imports wrong and accidentally getting an NDArray out when calling get on a Matrix.

The downside of course is all the extra bytecode. If we're doing 5 dimensions that's 63 fixed parameter and 32 vararg Any overloads of the same function. There are probably performance implications, and successfully calling the right one from a foreign language (java, javascript, C) would be an act of heroism, but it's still an idea to consider.

peastman commented 6 years ago

@kyonifer any thoughts on these suggestions? If you agree with them I'd like to try implementing them.

In addition to accepting Int and IntRange for each index, it would be good to also accept Iterable<Int>, just like kotlin.collections.slice(). That way you could specify a list of specific indices to take, not just a continuous range.

kyonifer commented 6 years ago

array[0..0, 0..10, 5..5]

I definitely agree this is ugly and we need something better.

I suggest changing the signature form vararg IntRange to vararg Any so the user can mix and match Ints and IntRanges. That will mean relying on runtime type checking to throw an exception if the user passes a value of any other type, which isn't ideal. But I think it's a minor tradeoff compared to the much more important benefits.

As much as possible I'd like to retain type safety, as I think it's one of the primary strengths doing numerical work in Kotlin has over scripting languages. One annoyance with having vararg Any overloads is that it also affects Matrix. i.e. matrixOf(1,2,3)["a","b"] now compiles and errors at run-time. That's a shame, because as you noted in the Matrix case there really isn't any issue with just having 4 overloads for IntRange and Int in 2 dimensions. It makes me rethink whether making NDArray the supertype of Matrix was worth the loss of clarity (its already the case that matrixOf(1,2,3)[0,0,0,0,0,0] compiles unfortunately).

With that in mind, what about supplying overloads for a certain number of dimensions? Since there's two chocies for each parameter, it'd have to be 2*(dimensions) overloads, which limits the extent to which we could do it

It wouldn't be unprecedented (consider kotlin's function overloads out to 23 parameters), so perhaps performance would be reasonable. We have more permutations though.

and we'd need a vararg version to catch the remaining cases

By this do you mean having overloads with varargs after 5 parameters? So for get/set with < 6 parameters it would be a compile-time error to use something other than an Int or IntRange, but if you have 6 or more params it would be a run-time error instead? If so I think that is a reasonable compromise between type safety and usability.

but it would have the nice side effect of letting us automatically upconvert to matrix where apporpriate

I really like this idea. There's little reason for Matrix to have its own get/set overloads now that NDArray is a supertype. If we can make the getter for 2 dimensions return a Matrix we could completely remove the Matrix-specific overloads.

Unlike Matrix, NDArray only supports slicing in get(), not set(), so you can't use it on the left side of an assignment. It should be supported in set() too.

But it doesn't support it for the lower bound, so you can't write -5..-1 to get the last five elements. It also doesn't support negative values for specific indices passed as Ints, only for ranges. And NDArray doesn't support negative indices at all. I suggest allowing them everywhere in any method that takes an index.

Completely agree with these.

kyonifer commented 6 years ago

To be clear, I'm envisioning something like this for the static dimensions = 2 case:

operator fun <T> NDArray<T>.get(i1: Int, i2: Int): T                                     
operator fun <T> NDArray<T>.get(i1: Int, i2: IntRange): Matrix<T>                        
operator fun <T> NDArray<T>.get(i1: IntRange, i2: Int): Matrix<T>                        
operator fun <T> NDArray<T>.get(i1: IntRange, i2: IntRange): Matrix<T>   

operator fun <T> NDArray<T>.get(i1: Int, i2: Int, vararg iN: Any): NDArray<T>            
operator fun <T> NDArray<T>.get(i1: Int, i2: IntRange, vararg iN: Any): NDArray<T>       
operator fun <T> NDArray<T>.get(i1: IntRange, i2: Int, vararg iN: Any): NDArray<T>       
operator fun <T> NDArray<T>.get(i1: IntRange, i2: IntRange, vararg iN: Any): NDArray<T>  

The obvious issue there is that all the vararg overloads have to return NDArray instead of Matrix because we don't know how many arguments were passed in. Increasing the number of statically checked dimensions to 5 as suggested would include overloads like:

operator fun <T> NDArray<T>.get(i1: IntRange, i2: Int, i3: Int, i4: IntRange, i5: Int): Matrix<T>

wherein we can return Matrix because we saw the user only sliced on 2 dimensions. Then we'd have vararg overloads for all the permutations of 5 static dimensions.

peastman commented 6 years ago

It wouldn't be unprecedented (consider kotlin's function overloads out to 23 parameters), so perhaps performance would be reasonable.

I think the only performance impact would be on compile time. It shouldn't make any difference to runtime speed.

What do you think about changing IntRange to Iterable<Int> in all of the above? That would give much more flexibility in how you can slice arrays. Internally it could still have special handling for IntRange (which implements Iterable<Int>).

kyonifer commented 6 years ago

Yep, extension functions are resolved statically so the performance hit will be at compile time. At runtime the effect will be bytecode bloat. Both are acceptable tradeoffs to me.

What do you think about changing IntRange to Iterable in all of the above? That would give much more flexibility in how you can slice arrays. Internally it could still have special handling for IntRange (which implements Iterable).

Sounds good here.

peastman commented 6 years ago

By my count, we'll need 89 versions of each function. That includes all combinations of up to 5 Ints or Iterables, plus another version that adds vararg Any as a sixth argument, but omitting the ones that are entirely Ints, so 1+3+7+15+31+32. We'll need all of those for each of three functions (get, set to a constant, and set to an array). Multiply by seven data types, and you get a total of 1869 new functions this will add. That's likely to have a big impact on compilation time. Also consider the effect on the API documentation and on IDEs. (For example, you hit the "Parameter Info" command and the entire screen fills up with possible argument lists.)

Are you sure that's ok?

kyonifer commented 6 years ago

Multiply by seven data types

Since slicing is an aggregate operation (loop over all the elements you are copying/setting) we probably can avoid this, instead dispatching via our reified-at-runtime when block trick to use specialized code for the actual looping if we make these functions inline and reify T.

Are you sure that's ok?

That does seem excessive yes. 5 may not be the magic number.

Another hybrid proposal that would further reduce the number of overloads we need:

We set the number of static dimensions to 2 (similar to my previous comment) and by doing so we secure type safety for matrix operations but not for larger N-D arrays. We then add overloads for just the permutations which have > 2 dimensions but only 1-2 slices so that they return Matrix. For example we could add the following to my previous list:

// 3 dim indexes with 1 or 2 slices in them
operator fun <T> NDArray<T>.get(i1: Int, i2: Int, i3: IntRange): Matrix<T>       
operator fun <T> NDArray<T>.get(i1: Int, i2: IntRange, i3: Int): Matrix<T>       
operator fun <T> NDArray<T>.get(i1: Int, i2: IntRange, i3: IntRange): Matrix<T>  
operator fun <T> NDArray<T>.get(i1: IntRange, i2: Int, i3: Int): Matrix<T>       
operator fun <T> NDArray<T>.get(i1: IntRange, i2: Int, i3: IntRange): Matrix<T>  
operator fun <T> NDArray<T>.get(i1: IntRange, i2: IntRange, i3: Int): Matrix<T>  

That should greatly reduce the number of methods that are generated: none of these overloads would have vararg variants. The resulting behavior would be:

val a = matrixOf(1,2,3)
val out1 = a[1..2, 1..5]    // Matrix<Int>
val out2 = a[1,1,1]         // NDArray<Int>
val out3 = a[1,1,1..5]      // Matrix<Int>
val out4 = a[1..4,1..4,5]   // Matrix<Int>
val out5 = a["a","b"]       // Compilation error
val out6 = a[1,1,"a"]       // NDArray<Int>, Runtime error 

Combined with reified-at-runtime when blocks to eliminate per-type overloads the number should be a lot more manageable. Then the question then is how many dimensions deep do we go with the "return Matrix<T> if less than 2 slices" overloads.

We could make the codegen to generate these overloads take in the number of dimensions as a parameter, so we aren't stuck with our decision. However, I don't think it makes sense to write any more sophisticated codegen yet since I'm planning to close out #76 in the near future. Right now I'm waiting to start #76 until #77 is unblocked, so I can upgrade the build system all at once.

Perhaps let's just start with the static dimensions=2 overloads. That will get us into "clearly better than what we have now" territory. We can open another issue to track adding some extensions that force returning a Matrix<T> when slicing <=2 dimensions inside the varargs, which could be completed after #76 is solved.

Does that sound reasonable to everyone?

peastman commented 6 years ago

That sounds reasonable. Is this a correct summary of what you're describing?

I do see some potential problems with returning either NDArray or Matrix depending on the arguments. First, it will have to be data type dependent, because Matrix supports fewer data types than NDArray. Suppose you write array[0..5, 0..5, 10]. If array is an NDArray<Double>, that should return a Matrix<Double>. But if it's an NDArray<Short>, it would need to return NDArray<Short>.

Another issue is that array[0..5, 0..5, 10] could be declared to return Matrix, but array[10, 0..5, 0..5] would be declared to return NDArray. The object it returned could still really be a Matrix, but you'd have to do a runtime cast to treat it as one.

Also, if the output only has one dimension, shouldn't it return a NDArray instead of a Matrix? Otherwise you force it to have a second dimension of size 1, which would be confusing and a potential source of bugs.

Given all these, I wonder if it wouldn't be more consistent and less confusing to always return NDArray. If the user really wants a Matrix they can call toMatrix() on it.

kyonifer commented 6 years ago

Is this a correct summary of what you're describing?

Yeah that would work for all the slicing (aggregate) overloads. We'd still need to retain the non-boxing versions for single element access, as those will often be called in tight loops.

Another issue is that array[0..5, 0..5, 10] could be declared to return Matrix, but array[10, 0..5, 0..5] would be declared to return NDArray. The object it returned could still really be a Matrix, but you'd have to do a runtime cast to treat it as one.

The idea of the extra overloads proposed here is that both of these would return Matrix<T> without a cast. In particular, array[10, 0..5, 0..5] would call

operator fun <T> NDArray<T>.get(i1: Int, i2: IntRange, i3: IntRange): Matrix<T> 

Also, if the output only has one dimension, shouldn't it return a NDArray instead of a Matrix? Otherwise you force it to have a second dimension of size 1, which would be confusing and a potential source of bugs.

Good point-- there's a fundamental mismatch in the base case between NDArray and Matrix. Not sure what makes sense here. Matrix ops always want to return a (2D) matrix, but NDArrays want to have their dimension property set to how much data they actually have. I've always found things like numpy.squeeze to be annoying.

Given all these, I wonder if it wouldn't be more consistent and less confusing to always return NDArray. If the user really wants a Matrix they can call toMatrix() on it.

I'm wondering if we don't actually need Matrix and NDArray to have a base class they both inherit from and not have Matrix be a subtype of NDArray. There's other issues than just this one too, such as the aforementioned fact that N-D indexing like someMatrix[1,2,3,4] should really be a compile-time error for Matrix. The downside of having Matrix and NDArray be implemented with a common base is that there would be .toNDArray and .toMatrix everywhere, as there used to be in koma before we switched to the current "is-a" relationship.

What I wouldn't give for some union types and value generics in kotlin...

kyonifer commented 6 years ago

Not sure what makes sense here. Matrix ops always want to return a (2D) matrix, but NDArrays want to have their dimension property set to how much data they actually have.

Another way to go is to have Matrix keep track of its dimensionality and differentiate between 1D matrices and 2D matrices. Then there would be no issue with always returning a Matrix for 1D slicing.

peastman commented 6 years ago

This might be a bit too radical of a suggestion, but I'll offer it in the spirit of brainstorming. Is there really a need for separate Matrix and NDArray classes? I mean, the only significant difference in their APIs is that Matrix includes functions that are only meaningful for a 2D matrix. It wouldn't make sense to call SVD() on a 1D array. But if you're considering letting Matrix have only one dimension, you'll have to deal with that anyway (probably by throwing an exception).

The other difference is in how they're implemented internally, but that doesn't require a different front end class. The factory could just create a different class for 2D arrays than for arrays of other shapes.

What I wouldn't give for some union types and value generics in kotlin...

Agreed!

drmoose commented 6 years ago

Indexing is already a mess of weird edge cases:

// Runtime vs compile time error seems pretty arbitrary.
eye(3)[10]  // runtime error
eye(3)[5, 5]  // runtime error
eye(3)[0 until 0, 0 until 0]  // compiles just fine and returns eye(3), a result that seems
                              // completely insane until you read the source and figure out
                              // what on earth is going on.
eye(3)[1, 2, 3] // but this one is compile time

// Declared type already matters more than it should
eye(3)[1, 2, 3]                       // compile time
(eye(3) as NDArray<Double>)[1, 2, 3]  // runtime

eye(3)[4]                       // magically row-major
(eye(3) as NDArray<Double>)[4]  // runtime error

And, the fact that a Matrix currently inherits from NDArray is immensely useful for writing generic data-marshaling functions. If Matrix<T> and NDArray<T> would be split up and share a common base class, call it NonIndexable<T>, then any library code that needs to index will end up with something horribly ugly, such as:

when (it) {
    is Matrix<*> -> it[2, 2]
    is NDArray<*> -> it[2, 2]
    else -> error("Woe is me; there is no such thing as a sealed interface.")
}

Admittedly, all I'm doing lately is writing drive-by comments while my code compiles because I haven't had time to contribute properly, but I'd really like to advocate having a single set of indexing rules that behave consistently.

How about this for a set of rules:

Is there really a need for separate Matrix and NDArray classes?

Or this. Making Matrix and NDArray into a single type would totally eliminate the risk of accidentally defining two incompatible APIs.

The fact that they're currently split was originally my fault. The argument went vaguely like this: having a separate matrix type lets third-party libraries declare Matrix<T> arguments rather than NDArray<T> to indicate to their caller that they're expecting math to work. The case I was worried about was some function in an opaque jar that does a bunch of data marshaling and then tries to math -- the user is presented with an exception message that doesn't seem to relate to their original input, and has a stack trace buried in incomprehensible disassembly. There was also a bit of static-language purism in there, refusing to admit that kotlin can't do better than python. I'm no longer totally convinced this isn't more trouble than it's worth.

As long as we're brainstorming, we could go even more ridiculously far down the "refuse to admit kotlin can't do better than python" rabbit hole and have Vector, Matrix, FourDArray, FiveDArray, etc up to however many dimensions we're willing to codegen, so more of these things could be checked at compile time. I'd still advocate having a single set of indexing rules and a base class that is itself as useful as possible.

kyonifer commented 6 years ago

Is there really a need for separate Matrix and NDArray classes?

It's certainly not an outrageous idea. We've had a few branches in the past where the types were unified (since deleted). Certainly many dynamically typed languages have a single N-D type. By my tally most of them do (Numpy, Julia, R, ...) with one notable language doing a NDArray/Matrix split (MATLAB). Static languages with more powerful templating have different types for every dimensionality (Eigen). Implementations in Java sometimes don't bother to be generic over the primitive type, often times don't bother to support more than two dimensions.

Kotlin is trying to fit in a "statically typed but simple and expressive" category where the normal patterns on either side don't directly apply. We can't do expression templates in Kotlin, and I'd argue making everything fully dynamic would be a waste of the language (might as well just use Python then!). Unfortunately it leads us to the tradeoff space that we've all illuminated here. I think (with some ugliness certainly) we've been able to stay typesafe with respect to the element type, so that's at least something over a fully dynamic solution. But the part thats still uncertain is dimensional typesafety.

The goal of the Matrix/NDArray split was to say "I have a 2D numerical container, therefore linear algebra operations make sense" in a type-safe way at compile-time. 2D was chosen as a special case because there's a huge number of functions that only apply on 2D (linear algebra). We still aren't doing fine-grained dimension or size checking, As mentioned, SVD will still fail on a row-vector and the user still gets offered the option to call svd on their row vector that is stuffed into a Matrix. We're just trying to lock down two types currently, we could easily go down the rabbit hole of e.g. only exposing decompositions on SquareMatrix, which guaranteed numRows==numCols. As both of you noted, the entire span of choices ranges from having separate types for every dimensionality (1D, 2D, 3D, ...) all the way to having one giant NDArray type that holds all dimensions and always fails at runtime.

Probably the latter approach would be the path of least resistance, as it's what most other libraries do. The former would be useful feedback for a user building a system up until we can't make it work inside of Kotlin's type system and we are inconsistent. My instinct is to try and make everything statically resolvable that we can figure out how to get Kotlin's compiler to enforce, so I'm hesitant to give up on the Matrix type. But if everyone is resigned that we can't find a clean path through indexing I can also resign myself. I'll probably want to take a stab or three at a solution before admitting defeat.

I'd really like to advocate having a single set of indexing rules that behave consistently.

Fair enough, one way or another we'll these types unified for the sake of the user's sanity.

The other difference is in how they're implemented internally, but that doesn't require a different front end class. The factory could just create a different class for 2D arrays than for arrays of other shapes.

If I understand correctly, that would mean back-end implementation classes like EJMLMatrix that only understand 2D would still need to exist, just not be exposed to the user. That'd be a shame, as I think the user could benefit from knowing they've got a 2D container when they do. We just need to figure out how to make the UX consistent. I think perhaps starting with letting matrices be aware of their dimensionality (they already are, since they know if they are a vector or a matrix, they just don't behave like they have 1D in calls like shape) and perhaps @drmoose's list of rules will be a starting point. If we can get Matrix to behave identically to NDArray when used as an NDArray I think we may not have an unresolvable issue with unifying indexing.

peastman commented 6 years ago

By my tally most of them do (Numpy, Julia, R, ...) with one notable language doing a NDArray/Matrix split (MATLAB).

Originally Python also had separate matrix and ndarray classes. Then they decided that had been a mistake, deprecated matrix, and advised people to use ndarray for everything.

2D is definitely an important special case, but it's less of a special case than the current API implies. For example, matrix multiplication is currently only available for Matrix. NDArray is limited to elementwise products. But actually it's very common to perform the equivalent operation on tensors of other ranks. Python's einsum() function is an example of going to the extreme in letting you perform sums over arbitrary pairs of axes. Matrix.transpose() is just a special case of permuting axes, something that also is often done on arrays of other shapes.

I've worked with C++ codes that turned everything into template arguments. That lets you really do everything "correctly", including producing the right output types. For example, if you multiply a row vector by a column vector, the result should be a scalar. If you multiply a column vector by a row vector, the result should be a 2D matrix. And if you multiply a row vector by another row vector, it shouldn't compile. But to do that in Kotlin you would need separate classes for RowVector, ColumnVector, and Matrix. And even then you would still be stuck depending on runtime checking, because the compiler couldn't guarantee the vectors had the same length. So I think we just have to accept that the dimensionality of arrays won't always be known at compile time (including knowing whether or not they're 2D).

kyonifer commented 6 years ago

Originally Python also had separate matrix and ndarray classes. Then they decided that had been a mistake, deprecated matrix, and advised people to use ndarray for everything.

I'm aware of the matrix class but I thought that was just used to offer matrix-wise operations by default and the internals were shared with ndarray. I wasn't aware of duplicated internal structures, although they only target running on top of C and we're trying to be multiplatform. I didn't use Python back in the days of Numarray and Numeric, so I may be missing some history.

I've worked with C++ codes that turned everything into template arguments.

Yep thats what I meant by eigen's expression templates. It lets you specify not just the dimensions but lengths, and checks everything at compile time. This is virtually impossible in Kotlin's limited type system.

kyonifer commented 6 years ago

Another data point in favor of unifying the types: the distinction between a row vector and a column vector isn't clear if we treat a matrix as having one dimension. To me that distinction strikes of a need for a matrix to be 2D (i.e. 1x5 vs 5x1), and thus reducing matrices to 1D in case of a NDArray slice is ambiguous: are we returning a row or column vector? Just returning a 1D NDArray would make this case more consistent, as the user would be forced to reshape to a 2D container to treat it as a matrix (or we'd just allow 1D NDArrays to be used as a column or row vector depending on the operation and operands).

kyonifer commented 6 years ago

Upon consideration, I can't come up with a comprehensive resolution to this issue as there's too many factors at play. I'm therefore going to suggest my usual approach to these situations: find the minimal delta that will sufficiently address the original issue and then open a flurry of issues for the other good ideas that came up. In particular I propose closing this issue by:

We also probably need to either fix or open issues for the other good points brought up here, including but not limited to:

eye(3)[4] // magically row-major

While incredibly useful, 1D row-major indexing of 2D containers might be too magical. I think we need to open a new issue to discuss this. In particular, we could always force users to use something more explicit like someMat.linear[5].

Runtime vs compile time error seems pretty arbitrary.

This probably deserves an issue of its own to see how consistent we can get while maintaining the different semantics of Matrix and NDArray.

Matrix.transpose() is just a special case of permuting axes, something that also is often done on arrays of other shapes.

This definitely deserves an issue. I think we can fix up NDArray tensors while still maintaining the special cases like transpose for Matrix only as convenient syntax for linalg users.

But actually it's very common to perform the equivalent operation on tensors of other ranks. Python's einsum() function is an example of going to the extreme in letting you perform sums over arbitrary pairs of axes.

There's a lot of methods like this that need to be added and deserve issues (also related to #29).

peastman commented 6 years ago

That sounds like a good plan. I'll start on the changes to #90.

peastman commented 6 years ago

It turns out the following doesn't work:

  • For each data type, there will be a single internal function declared as get(vararg indices: Any): NDArray that provides the implementation.
  • For each combination of arguments, there will be a public inline function that checks the data type and calls the correct implementation.

A public inline function isn't allowed to call non-public functions. That kind of makes sense, since inline code is effectively part of the code that calls it. But that leaves me a bit stuck as to how to implement this. If it's not inline, it can only call the generic version, not the type specific one. I could make get() and set() into methods of NDArray, but then they would take precedence over the extension functions defined for Matrix.

peastman commented 5 years ago

Any thoughts about this? I haven't come up with any other solution. So I see two options. One is to have separate definitions for every set of arguments, for each function, for every data type. To statically type check the first two indices we need eight versions of each function, so a total of 168 functions. That's a bunch, though not nearly as bad as it would have been to check the first five indices. The other option is just to leave them as vararg Any and rely on runtime checking of the indices.

kyonifer commented 5 years ago

How about going with your two bullet points, but instead of an internal function called get we make it public and rename it to something else like getArbitrary? Since it's internal it isn't intended to be used by the user anyway, and we can hide it along with getDouble getFloat et al when we get around to doing that. (I actually have a plan on how to hide those in the very near future, when I can carve out some time). For now we can just note its not intended to be used by the user directly, like we do with all the get$PRIMITIVE methods.