Make NDArray's map function generic

drmoose commented 6 years ago

NDArray's map is defined as fun map(f: (T) -> T): NDArray<T>. It would be more useful if map were a generic, in the sense of fun map<R>(f: (T) -> R): NDArray<R>.

Here's some sample code which I'd like to compile, but can't:

var counter = 0
val ints: NDArray<Int> = DefaultNDArray(3, 4, 5) { ++counter }
val doubles: NDArray<Double> = ints.map { it.toDouble() }

Currently I'm using the following hacky workaround, which is dangerous and insane but seems to work.

fun <TIn, TOut> NDArray<TIn>.crazyMapWorkaround(fn: (TIn) -> TOut) = 
    toIterable().iterator().let { itr ->
        DefaultNDArray(*(shape().toIntArray())) { fn(itr.next()) }
    }

peastman commented 5 years ago

Are you interested in code contributions from other people? If so, I'd like to try implementing this.

kyonifer commented 5 years ago

Im definitely interested in contributions! Feel free to open an MR.

The current map is defined here. If its easier for you, you can just fix the map extension function directly in that file and I'll deal with updating the codegen (that file is actually generated from templates in order to optimize boxing for primitives).

peastman commented 5 years ago

Great! I'll see whether I can figure out how the codegen works. Are there any examples of existing functions that depend on two type parameters?

peastman commented 5 years ago

The changes to the individual functions are quite simple. For example,

inline fun  NDArray<Double>.map(f: (Double) -> Double)
    = NDArray.doubleFactory.zeros(*shape().toIntArray()).fillLinear { f(this.getDouble(it)) }

gets changed to

inline fun <T> NDArray<Double>.map(crossinline f: (Double) -> T)
    = NDArray(*shape().toIntArray(), filler={ f(this.getDouble(*it)) } )

Now I just need to figure out how that gets generated.

peastman commented 5 years ago

Ok, I think I've mostly figured it out. How do you want me to handle the type parameters? Currently you use the macro ${genDec} which gets defined to either "" or "<T>". In this case it needs to be either "<R>" or "<T,R>". I could modify the build script to define another macro ${genDecR}, if you think that seems right.

kyonifer commented 5 years ago

The map functions defined on each specific primitive type are designed to avoid boxing elements. For example, in your quoted function above NDArray<Double>.map(f: (Double) -> Double) we implement the mapping without converting the primitive doubles into boxed Double classes at any point. Avoiding boxing is also why these are implemented as inline extension functions. In order to avoid the performance penalty of boxing each element in the array we'll have to leave those as Double -> Double mappings.

The generic map function is a catchall that maps arbitrary generic T to another arbitrary T, and is designed to work with any NDArray including those holding non-primitive types. The generic map will out of necessity box every element, because we don't know what the element type is. However, it currently maps T to T instead of T to R. I think the goal of this issue is to overhaul the above linked generic map, and hopefully leave the primitive->primitive optimized maps as they are.

peastman commented 5 years ago

Where would the values get boxed? Suppose, for example, that f maps Double to Float. For each element it calls this.getDouble(), which returns a Double. That gets passed straight to f with no need for boxing. f returns a Float, which again shouldn't need to be boxed.

peastman commented 5 years ago

I do see one way it could be optimized though. NDArray.invoke() expects filler to have type (IntArray) -> T. It passes that to fill(), which has to create an IntArray for every element. That could be avoided by having a second version of invoke() that expects filler to be (Int) -> T, and passes it to fillLinear().

drmoose commented 5 years ago

The JVM implements generics using type erasure, which means e.g. List<T> is really just List<*> but with some compile-time hinting about the type of the objects in the list.

So, a generic fun <T, R> NDArray.map(fn: (T) -> R): NDArray<R>, when going from Double to Float as in your example, has to pass the Doubles into the fn as instances of java.lang.Double, the boxed object type, and the function has to return java.lang.Float (boxed) rather than the primitives double and float, because the platform signature is actually (Any) -> Any (or in java-speak Object to Object).

Similarly, I'm pretty sure NDArray.invoke(fn: (IntArray) -> T) and NDArray.invoke(fn: (Int) -> T) can't simultaneously exist, because the JVM signatures of the two would both be (Any) -> Any

Because of type erasure, and because Kotlin hides the difference between double and java.lang.Double with layers of syntax sugar and magic, figuring out how to avoid boxing when possible yet achieve this api is annoyingly hard. In particular, I'm not even completely sure I know a way to make a <T, R> version of map without causing platform signature clashes with the codegen'd primitive versions (double -> double and so on). That's why I filed the issue rather than just fixing it -- I didn't have (still don't have) time to do the experimentation and sort it out myself enough to make a coherent PR.

For some of my previous PRs, I've had to disassemble compiled .class files to Java to make sure I wasn't boxing by mistake.

peastman commented 5 years ago

Let me look into this. Possibly I'm thinking about this wrong based on too many years of doing C++. Note that the signature fun <T, R> NDArray.map(fn: (T) -> R): NDArray<R> is what gets used for the generic version, not for primitive types. Those use signatures like fun <T> NDArray<Double>.map(crossinline f: (Double) -> T).

If you're concerned that any type parameters at all will lead to boxing, we could create explicit versions for all combinations, e.g. fun NDArray<Double>.map(crossinline f: (Double) -> Float). I'm pretty sure that should be completely safe.

peastman commented 5 years ago

Except if we do that it considers calls to map() to be ambiguous. :( So maybe you're right—we'll have to do this only for the generic version, and accept that mappings between two different types have to pay the cost of boxing. But I'm going to ponder this a bit more. There really ought to be a way to do it efficiently for mappings between two primitive types.

kyonifer commented 5 years ago

Possibly I'm thinking about this wrong based on too many years of doing C++

Indeed, C++ templates are based on specialization, and behave very differently than Java's type erasure.

If you're concerned that any type parameters at all will lead to boxing

As a general rule of thumb if you use T and therefore don't know the type at compile time, that argument will get erased to Object as @drmoose explained. This is even true for Kotlin inline functions regrettably, as they don't get specialized to each call site but instead are generated once at compile time and inserted unmodified into each call site (there's a kotlin youtrack ticket discussing the pros and cons of their approach). Hopefully when project valhalla is released in a future JVM we'll get proper primitive generic specialization and all of this nonsense can be put behind us. Some day...

There really ought to be a way to do it efficiently for mappings between two primitive types.

There certainly are viable approaches to add cross-primitive mappings, the most obvious and ugly one being to use different platform names than map for the cross-primitive versions of the function. However we do it, these would be in addition to the current generic map and specialized Double->Double, Int->Int, etc. maps. I'd therefore propose our path forward to be:

1) Fix the generic map as suggested in the original issue description, yielding a T -> R map with boxing, and leaving the current same-primitive maps intact. This should be enough to close out this issue.

2) If it is of interest and has use cases to warrant doing so, lets open a new issue to discuss the best solution for the cross-primitive maps you are proposing. I'm guessing there'll be some intrigue there on how to optimize for each platform too, as kotlin/js behaves differently than kotlin/jvm (only one true numerical type in js), and kotlin/native specialization is still in flux.

3) Since this project is trying to implement an efficient yet generic multiplatform numerical library, it does some things that seem curious upon first glance such as these codegen'd inline extension functions. I probably need to open a issue to track writing a FAQ that explains the tradeoffs of the current approach, as koma has no documentation on that currently.

peastman commented 5 years ago

This is even true for Kotlin inline functions regrettably

That's too bad. I was hoping I could do something with inlines, but hadn't researched them yet to find out how they get implemented.

I spent a lot of years programming in Java, but for the last ten years I've mostly been doing C++ and Python (plus CUDA and OpenCL). I just recently started getting into Kotlin in the hope it might one day become a viable alternative to Python for scientific computing. Hence my interest in Koma, which looks like a really excellent piece of software.

peastman commented 5 years ago

I haven't come up with anything really satisfactory. One option is to only change the generic version, while leaving all the others unchanged. But that would mean you could only call it on generic arrays, not on the primitive-specialized ones. To use the original example of mapping Ints to Doubles, you couldn't call it on an array of primitive Ints, only on a generic array containing boxed Ints. That doesn't seem like something people should be encouraged to create.

An alternative would be to create a new method with a new name. That could be implemented as an ordinary method on the NDArray class. It would do boxing as part of the mapping process, but at least it could be called on arrays of primitives.

Also, it's possible to do this in a reasonably clean way with the methods that already exist.

val doubles = NDArray(3, 4, 5, filler={ ints.getDouble(*it) })

That's not quite as nice as just calling map(), but it's not so bad either. If you make it a little bit less concise, you can even (I think) do it without any boxing.

val doubles = NDArray.doubleFactory.create(3, 4, 5, filler={ ints.getDouble(*it) })

drmoose commented 5 years ago

How about making the generic version into <T, reified R> and pulling the same ugly trick I did to make NDArray(3, 4, 5) { 10.0 } use the optimized backends based on the return type of the filler function? Would that work here too?

peastman commented 5 years ago

That's basically what my original version did. It called that function. But according to what @kyonifer said, that would still involve boxing, since filler would get implemented internally as a (IntArray) -> Any. So you could call it on specialized array types, and it would produce the correct specialized array type as output, but the execution of the map() function itself would be less efficient.

peastman commented 5 years ago

I think the assumption that boxing is slow may be incorrect. The JVM performs scalar replacement. If it can prove that the boxed number doesn't escape from the local function, it will optimize it away and implement it as a primitive. (See https://shipilev.net/jvm-anatomy-park/18-scalar-replacement/.) So I wrote a benchmark to test how fast the map function really is.

fun doMapping(x: NDArray<Double>): Double {
    val y = x.map { 0.5*it }
    return y[0, 0]
}

val x = NDArray.doubleFactory.randn(1000, 1000)
var sum = 0.0

// Warm up the JIT

for (i in 1..10)
    sum += doMapping(x)

// Time map function.

val time = kotlin.system.measureTimeMillis {
    for (i in 1..10)
        sum += doMapping(x)
}
println(time)
println(sum) // Avoid dead code elimination

I first ran it with the existing implementation of map(). Over five runs, the mean time was 1957 ms (range was 1932 to 2036). I then replaced the implementation for Double arrays with this.

inline fun <reified T> NDArray<Double>.map(crossinline f: (Double) -> T) =
    when(T::class) {
        Double::class -> NDArray.doubleFactory.zeros(*shape().toIntArray()).fillLinear { f(this.getDouble(it)) as Double }
        Float::class  -> NDArray.floatFactory.zeros(*shape().toIntArray()).fillLinear { f(this.getDouble(it)) as Float }
        Long::class   -> NDArray.longFactory.zeros(*shape().toIntArray()).fillLinear { f(this.getDouble(it)) as Long }
        Int::class    -> NDArray.intFactory.zeros(*shape().toIntArray()).fillLinear { f(this.getDouble(it)) as Int }
        Short::class  -> NDArray.shortFactory.zeros(*shape().toIntArray()).fillLinear { f(this.getDouble(it)) as Short }
        Byte::class   -> NDArray.byteFactory.zeros(*shape().toIntArray()).fillLinear { f(this.getDouble(it)) as Byte }
        else          -> NDArray.createGeneric(*shape().toIntArray()) { f(this.getDouble(*it)) }
    } as NDArray<T>

Again running the benchmark five times, the mean time was 1951 ms (range was 1815 to 2056). Basically identical to the original version, even though in theory the numbers are getting boxed.

drmoose commented 5 years ago

I'm confused. Both of your examples avoid boxing -- your second example uses crossinline and some runtime type checking the same way NDArray.invoke does. Using a renamed copy of your .map function, the following kotlin code:

fun testPeastmanMap(input: NDArray<Double>) = input.peastmanMap {
    it.toInt() + 15
}

diassembles to

   public static final NDArray testPeastmanMap(@NotNull NDArray input) {
      Intrinsics.checkParameterIsNotNull(input, "input");
      NDArray $receiver$iv = input;
      KClass var2 = Reflection.getOrCreateKotlinClass(Integer.class);
      NDArray $receiver$iv$iv;
      NDArray $receiver$iv$iv;
      int idx$iv$iv;
      int var7;
      NumericalNDArrayFactory var10000;
      int[] var10001;
      NDArray var28;
      if (Intrinsics.areEqual(var2, Reflection.getOrCreateKotlinClass(Double.TYPE))) {
         // unreachable code
      } else {
         double it;
         if (Intrinsics.areEqual(var2, Reflection.getOrCreateKotlinClass(Float.TYPE))) {
            // unreachable code
         } else if (Intrinsics.areEqual(var2, Reflection.getOrCreateKotlinClass(Long.TYPE))) {
            // unreachable code
         } else if (Intrinsics.areEqual(var2, Reflection.getOrCreateKotlinClass(Integer.TYPE))) {
            var10000 = NDArray.Companion.getIntFactory();
            var10001 = CollectionsKt.toIntArray((Collection)input.shape());
            $receiver$iv$iv = var10000.zeros(Arrays.copyOf(var10001, var10001.length));
            $receiver$iv$iv = $receiver$iv$iv;
            idx$iv$iv = 0;

            for(var7 = $receiver$iv$iv.getSize(); idx$iv$iv < var7; ++idx$iv$iv) {
               it = $receiver$iv.getDouble(idx$iv$iv);
               int var25 = (int)it + 15;  // <--- Here's the crossinline'd fn. No boxing.
               $receiver$iv$iv.setInt(idx$iv$iv, var25);
            }

            var28 = $receiver$iv$iv;
         } else if (Intrinsics.areEqual(var2, Reflection.getOrCreateKotlinClass(Short.TYPE))) {
            // unreachable code
         } else if (Intrinsics.areEqual(var2, Reflection.getOrCreateKotlinClass(Byte.TYPE))) {
            // unreachable code
         } else {
            // unreachable code
         }
      }

      return var28;
   }

If you wanted to benchmark a version with boxing, you'd need to break reified-at-runtime trick, with something that forces the use of java generics like this:

inline fun <reified T> NDArray<Double>.badMap(noinline f: (Double) -> T) =
        NDArray(*shape().toIntArray()) { f(getDouble(*it)) }

drmoose commented 5 years ago

It just occurred to me we may all be overthinking this -- if T and R were both reified, we could just nest two instances of the reified-at-runtime trick. There'd be a boatload of unreachable invalid bytecode, but the JVM's branch prediction is smart enough that there'd be practically no runtime cost.

kyonifer commented 5 years ago

I may be the source of confusion as I was too brief in my above comment. Kotlin does perform an optimization for lambdas that are passed into inline functions, which is something we exploit in NDArray.invoke and elsewhere to get around the boxing costs of passing in generic lambdas. This optimization includes inlining the lambda body into the caller, and if the caller is passing in primitives to avoid boxing. Thus it will avoid creating a Function object at all in such a scenario. You can see that optimization at work in @drmoose 's disassembled code, which produces no Function object and performs no boxing.

What I was referring to before is the fact that inline functions won't specialize on insertion, i.e. Kotlin's compiler won't create a type-specific version of the function for each value of T. This is the reason we have to do the when block dispatch, as otherwise it'll do generic (boxed) invocation.

Some reading on the topic: https://medium.com/@BladeCoder/exploring-kotlins-hidden-costs-part-1-fbb9935d9b62

WRT this issue, I think using the ugly NDArray.invoke when block dispatch hack for each primitive is a reasonable workaround for now. I'm still hopeful that valhalla will complete some day, so these solutions will only be temporary.

drmoose commented 5 years ago

we could just nest two instances of the reified-at-runtime trick

...aaand I'm wrong. getDouble(*it) as T is a boxing cast even when T is reified.

drmoose commented 5 years ago

No wait, I'm wrong about being wrong. getDouble(*it) as T inside a when (T::class) -> isn't a boxing cast. I was reading one of the unreachable branches in the disassembly by mistake, so having a when(T::class) where each of its branches are when(R::class) may be a viable approach, albeit a horribly ugly one.

drmoose commented 5 years ago

For posterity, here are the test functions I used to investigate, which I think are perhaps partway down the right road. Any real solution would want to avoid the NDArray.invokes I have here in favor of fillLinear, because as @peastman pointed out invoke allocates bunches of IntArrays we don't need. I'd flesh this out more, but at some point I need to go back to doing my actual job.

inline fun <reified R> NDArray<Double>.mapDouble(crossinline f: (Double) -> R) =
            NDArray(*shape().toIntArray()) { f(getDouble(*it)) } as NDArray<R>

inline fun <reified T, reified R> NDArray<T>.bruteForceMap(crossinline f: (T) -> R) =
    when (T::class) {
        Double::class -> (this as NDArray<Double>).mapDouble { f(it as T) }
        else ->        NDArray(*shape().toIntArray()) { f(get(*it)) }
    } as NDArray<R>

fun testBruteForceMap(input: NDArray<Double>) = input.bruteForceMap { it.toInt() + 15 }

kyonifer commented 5 years ago

Now that I've had a moment to sit down and play with some code, I'm thinking the when block isnt necessary at all. Starting with your original proposal @peastman:

inline fun <reified T> NDArray<Double>.testMap(crossinline f: (Double) -> T)
        = NDArray(*shape().toIntArray(), filler={ f(this.getDouble(*it)) } )

Then adding a generic version:

fun <T, R> NDArray<T>.testMap(f: (T) -> R)
        = NDArray.createGeneric<R>(*shape().toIntArray(), filler = { f(this.getGeneric(*it)) })

All of these calls should be okay:

    val a = NDArray.doubleFactory.zeros(1, 2)
    val outa1 = a.testMap { it+144441 }
    val outa2 = a.testMap { "greetings" }
    val b = NDArray.createGeneric<String>(5,5) { "hi" }
    val outb = b.testMap { "bye" }

Moreover the boxing I thought would occur on inline fun <T> NDArray<Double>.map(crossinline f: (Double) -> T) is apparently optimized out by the compiler:

      if (Intrinsics.areEqual(var5, Reflection.getOrCreateKotlinClass(Double.TYPE))) {
         $receiver$iv$iv$iv = this_$iv$iv.getDoubleFactory().zeros(Arrays.copyOf(dims$iv$iv, dims$iv$iv.length));
         $receiver$iv$iv$iv = $receiver$iv$iv$iv;
         var9 = $receiver$iv$iv$iv.iterateIndices().iterator();

         while(var9.hasNext()) {
            var10 = (IndexIterator)var9.next();
            nd$iv$iv$iv = var10.component1();
            linear$iv$iv$iv = var10.component2();
            it = $receiver$iv.getDouble(Arrays.copyOf(nd$iv$iv$iv, nd$iv$iv$iv.length));
            double var16 = it + (double)144441;
            $receiver$iv$iv$iv.setDouble(linear$iv$iv$iv, var16);
         }

         var29 = $receiver$iv$iv$iv;
      } else if (Intrinsics.areEqual(var5, Reflection.getOrCreateKotlinClass(Float.TYPE))) {
            ...

So it looks like I was wrong about the boxing that would occur for Double->T inline functions. I was reasonably certain the compiler wouldn't optimize that away, but it seems to be not the case. Sorry for leading all of us down a rabbit hole for no reason. I think going with the original proposal might work.

How do you want me to handle the type parameters? Currently you use the macro ${genDec} which gets defined to either "" or "". In this case it needs to be either "" or "<T,R>". I could modify the build script to define another macro ${genDecR}, if you think that seems right.

That sounds reasonable to me.

peastman commented 5 years ago

Note that there are two different compilers involved. There's the Kotlin compiler, which converts source to Java bytecode. And then there's Hotspot, which is the JIT that converts bytecode to native machine code. Scalar replacement is an optimization done by Hotspot. If you just disassemble the class files, you might still see them working with boxed integers. But if you disassemble the native machine code, you'll find that it's optimized them away.

Anyway, it's good to know the Kotlin compiler is also smarter than I thought. I was afraid the crossinline meant it would have to implement f as an actual function that returned Any. I'll go ahead and try implementing it this way.

kyonifer commented 5 years ago

Note that there are two different compilers involved.

Indeed there are. Whenever possible I tend not to rely on HotSpot optimizations saving us though, as Koma also builds on native and javascript which may not be as smart, and there are other jvm implementations in the wild which may not be as optimized, and HotSpot may fail to detect the optimization in complicated situations. My hope is that by expressing things as optimized Kotlin in the first place we'll be in a better position to get good performance everywhere.

That being said, when running on OpenJDK it is useful to run with -XX:+PrintAssembly enabled to see what asm the JIT is generating, but I mostly use it as a debugging tool when things are slower than I expect.

peastman commented 5 years ago

inline fun <reified T> NDArray<Double>.testMap(crossinline f: (Double) -> T)
        = NDArray(*shape().toIntArray(), filler={ f(this.getDouble(*it)) } )

This version turns out to be about 50% slower in my test. The reason is that NDArray.invoke() uses fill() instead of fillLinear(). I'd like to create an alternate version where filler is an (Int) -> T instead of (IntArray) -> T, so it can use the more efficient fillLinear(). But that would break type inference when filler was a lambda (the compiler couldn't automatically determine what its input was meant to be). So it would have to have a different name, or go in a different class, or perhaps just be private. What do you prefer?

drmoose commented 5 years ago

"different name" makes sense to me, NDArray.fromLinear(*dims) { /* filler */ }? createLinear?

The unnecessary IntArray crops up in a bunch of other places too. In retrospect it might've made sense to have a dedicated NDArrayIndex type that supplies both (doing the linearToIdx math as a lazy evaluation), but at this point such a change would break much code indeed. It might be worth looking into as a separate issue. @kyonifer?

kyonifer commented 5 years ago

A new method named createLinear sounds good to me.

kyonifer / koma

Make NDArray's map function generic #47