Closed drmoose closed 5 years ago
Are you interested in code contributions from other people? If so, I'd like to try implementing this.
Im definitely interested in contributions! Feel free to open an MR.
The current map is defined here. If its easier for you, you can just fix the map
extension function directly in that file and I'll deal with updating the codegen (that file is actually generated from templates in order to optimize boxing for primitives).
Great! I'll see whether I can figure out how the codegen works. Are there any examples of existing functions that depend on two type parameters?
The changes to the individual functions are quite simple. For example,
inline fun NDArray<Double>.map(f: (Double) -> Double)
= NDArray.doubleFactory.zeros(*shape().toIntArray()).fillLinear { f(this.getDouble(it)) }
gets changed to
inline fun <T> NDArray<Double>.map(crossinline f: (Double) -> T)
= NDArray(*shape().toIntArray(), filler={ f(this.getDouble(*it)) } )
Now I just need to figure out how that gets generated.
Ok, I think I've mostly figured it out. How do you want me to handle the type parameters? Currently you use the macro ${genDec}
which gets defined to either ""
or "<T>"
. In this case it needs to be either "<R>"
or "<T,R>"
. I could modify the build script to define another macro ${genDecR}
, if you think that seems right.
The map
functions defined on each specific primitive type are designed to avoid boxing elements. For example, in your quoted function above NDArray<Double>.map(f: (Double) -> Double)
we implement the mapping without converting the primitive doubles into boxed Double
classes at any point. Avoiding boxing is also why these are implemented as inline extension functions. In order to avoid the performance penalty of boxing each element in the array we'll have to leave those as Double -> Double
mappings.
The generic map function is a catchall that maps arbitrary generic T
to another arbitrary T
, and is designed to work with any NDArray including those holding non-primitive types. The generic map will out of necessity box every element, because we don't know what the element type is. However, it currently maps T
to T
instead of T
to R
. I think the goal of this issue is to overhaul the above linked generic map, and hopefully leave the primitive->primitive optimized maps as they are.
Where would the values get boxed? Suppose, for example, that f
maps Double to Float. For each element it calls this.getDouble()
, which returns a Double. That gets passed straight to f
with no need for boxing. f
returns a Float, which again shouldn't need to be boxed.
I do see one way it could be optimized though. NDArray.invoke()
expects filler
to have type (IntArray) -> T
. It passes that to fill()
, which has to create an IntArray
for every element. That could be avoided by having a second version of invoke()
that expects filler
to be (Int) -> T
, and passes it to fillLinear()
.
The JVM implements generics using type erasure, which means e.g. List<T>
is really just List<*>
but with some compile-time hinting about the type of the objects in the list.
So, a generic fun <T, R> NDArray.map(fn: (T) -> R): NDArray<R>
, when going from Double to Float as in your example, has to pass the Doubles into the fn
as instances of java.lang.Double
, the boxed object type, and the function has to return java.lang.Float
(boxed) rather than the primitives double
and float
, because the platform signature is actually (Any) -> Any
(or in java-speak Object
to Object
).
Similarly, I'm pretty sure NDArray.invoke(fn: (IntArray) -> T)
and NDArray.invoke(fn: (Int) -> T)
can't simultaneously exist, because the JVM signatures of the two would both be (Any) -> Any
Because of type erasure, and because Kotlin hides the difference between double
and java.lang.Double
with layers of syntax sugar and magic, figuring out how to avoid boxing when possible yet achieve this api is annoyingly hard. In particular, I'm not even completely sure I know a way to make a <T, R>
version of map
without causing platform signature clashes with the codegen'd primitive versions (double -> double and so on). That's why I filed the issue rather than just fixing it -- I didn't have (still don't have) time to do the experimentation and sort it out myself enough to make a coherent PR.
For some of my previous PRs, I've had to disassemble compiled .class
files to Java to make sure I wasn't boxing by mistake.
Let me look into this. Possibly I'm thinking about this wrong based on too many years of doing C++. Note that the signature fun <T, R> NDArray.map(fn: (T) -> R): NDArray<R>
is what gets used for the generic version, not for primitive types. Those use signatures like fun <T> NDArray<Double>.map(crossinline f: (Double) -> T)
.
If you're concerned that any type parameters at all will lead to boxing, we could create explicit versions for all combinations, e.g. fun NDArray<Double>.map(crossinline f: (Double) -> Float)
. I'm pretty sure that should be completely safe.
Except if we do that it considers calls to map()
to be ambiguous. :( So maybe you're right—we'll have to do this only for the generic version, and accept that mappings between two different types have to pay the cost of boxing. But I'm going to ponder this a bit more. There really ought to be a way to do it efficiently for mappings between two primitive types.
Possibly I'm thinking about this wrong based on too many years of doing C++
Indeed, C++ templates are based on specialization, and behave very differently than Java's type erasure.
If you're concerned that any type parameters at all will lead to boxing
As a general rule of thumb if you use T
and therefore don't know the type at compile time, that argument will get erased to Object as @drmoose explained. This is even true for Kotlin inline functions regrettably, as they don't get specialized to each call site but instead are generated once at compile time and inserted unmodified into each call site (there's a kotlin youtrack ticket discussing the pros and cons of their approach). Hopefully when project valhalla is released in a future JVM we'll get proper primitive generic specialization and all of this nonsense can be put behind us. Some day...
There really ought to be a way to do it efficiently for mappings between two primitive types.
There certainly are viable approaches to add cross-primitive mappings, the most obvious and ugly one being to use different platform names than map
for the cross-primitive versions of the function. However we do it, these would be in addition to the current generic map and specialized Double->Double, Int->Int, etc. maps. I'd therefore propose our path forward to be:
1) Fix the generic map as suggested in the original issue description, yielding a T -> R
map with boxing, and leaving the current same-primitive maps intact. This should be enough to close out this issue.
2) If it is of interest and has use cases to warrant doing so, lets open a new issue to discuss the best solution for the cross-primitive maps you are proposing. I'm guessing there'll be some intrigue there on how to optimize for each platform too, as kotlin/js behaves differently than kotlin/jvm (only one true numerical type in js), and kotlin/native specialization is still in flux.
3) Since this project is trying to implement an efficient yet generic multiplatform numerical library, it does some things that seem curious upon first glance such as these codegen'd inline extension functions. I probably need to open a issue to track writing a FAQ that explains the tradeoffs of the current approach, as koma has no documentation on that currently.
This is even true for Kotlin inline functions regrettably
That's too bad. I was hoping I could do something with inlines, but hadn't researched them yet to find out how they get implemented.
I spent a lot of years programming in Java, but for the last ten years I've mostly been doing C++ and Python (plus CUDA and OpenCL). I just recently started getting into Kotlin in the hope it might one day become a viable alternative to Python for scientific computing. Hence my interest in Koma, which looks like a really excellent piece of software.
I haven't come up with anything really satisfactory. One option is to only change the generic version, while leaving all the others unchanged. But that would mean you could only call it on generic arrays, not on the primitive-specialized ones. To use the original example of mapping Ints to Doubles, you couldn't call it on an array of primitive Ints, only on a generic array containing boxed Ints. That doesn't seem like something people should be encouraged to create.
An alternative would be to create a new method with a new name. That could be implemented as an ordinary method on the NDArray class. It would do boxing as part of the mapping process, but at least it could be called on arrays of primitives.
Also, it's possible to do this in a reasonably clean way with the methods that already exist.
val doubles = NDArray(3, 4, 5, filler={ ints.getDouble(*it) })
That's not quite as nice as just calling map()
, but it's not so bad either. If you make it a little bit less concise, you can even (I think) do it without any boxing.
val doubles = NDArray.doubleFactory.create(3, 4, 5, filler={ ints.getDouble(*it) })
How about making the generic version into <T, reified R>
and pulling the same ugly trick I did to make NDArray(3, 4, 5) { 10.0 }
use the optimized backends based on the return type of the filler function? Would that work here too?
That's basically what my original version did. It called that function. But according to what @kyonifer said, that would still involve boxing, since filler
would get implemented internally as a (IntArray) -> Any
. So you could call it on specialized array types, and it would produce the correct specialized array type as output, but the execution of the map()
function itself would be less efficient.
I think the assumption that boxing is slow may be incorrect. The JVM performs scalar replacement. If it can prove that the boxed number doesn't escape from the local function, it will optimize it away and implement it as a primitive. (See https://shipilev.net/jvm-anatomy-park/18-scalar-replacement/.) So I wrote a benchmark to test how fast the map function really is.
fun doMapping(x: NDArray<Double>): Double {
val y = x.map { 0.5*it }
return y[0, 0]
}
val x = NDArray.doubleFactory.randn(1000, 1000)
var sum = 0.0
// Warm up the JIT
for (i in 1..10)
sum += doMapping(x)
// Time map function.
val time = kotlin.system.measureTimeMillis {
for (i in 1..10)
sum += doMapping(x)
}
println(time)
println(sum) // Avoid dead code elimination
I first ran it with the existing implementation of map()
. Over five runs, the mean time was 1957 ms (range was 1932 to 2036). I then replaced the implementation for Double arrays with this.
inline fun <reified T> NDArray<Double>.map(crossinline f: (Double) -> T) =
when(T::class) {
Double::class -> NDArray.doubleFactory.zeros(*shape().toIntArray()).fillLinear { f(this.getDouble(it)) as Double }
Float::class -> NDArray.floatFactory.zeros(*shape().toIntArray()).fillLinear { f(this.getDouble(it)) as Float }
Long::class -> NDArray.longFactory.zeros(*shape().toIntArray()).fillLinear { f(this.getDouble(it)) as Long }
Int::class -> NDArray.intFactory.zeros(*shape().toIntArray()).fillLinear { f(this.getDouble(it)) as Int }
Short::class -> NDArray.shortFactory.zeros(*shape().toIntArray()).fillLinear { f(this.getDouble(it)) as Short }
Byte::class -> NDArray.byteFactory.zeros(*shape().toIntArray()).fillLinear { f(this.getDouble(it)) as Byte }
else -> NDArray.createGeneric(*shape().toIntArray()) { f(this.getDouble(*it)) }
} as NDArray<T>
Again running the benchmark five times, the mean time was 1951 ms (range was 1815 to 2056). Basically identical to the original version, even though in theory the numbers are getting boxed.
I'm confused. Both of your examples avoid boxing -- your second example uses crossinline and some runtime type checking the same way NDArray.invoke does. Using a renamed copy of your .map function, the following kotlin code:
fun testPeastmanMap(input: NDArray<Double>) = input.peastmanMap {
it.toInt() + 15
}
diassembles to
public static final NDArray testPeastmanMap(@NotNull NDArray input) {
Intrinsics.checkParameterIsNotNull(input, "input");
NDArray $receiver$iv = input;
KClass var2 = Reflection.getOrCreateKotlinClass(Integer.class);
NDArray $receiver$iv$iv;
NDArray $receiver$iv$iv;
int idx$iv$iv;
int var7;
NumericalNDArrayFactory var10000;
int[] var10001;
NDArray var28;
if (Intrinsics.areEqual(var2, Reflection.getOrCreateKotlinClass(Double.TYPE))) {
// unreachable code
} else {
double it;
if (Intrinsics.areEqual(var2, Reflection.getOrCreateKotlinClass(Float.TYPE))) {
// unreachable code
} else if (Intrinsics.areEqual(var2, Reflection.getOrCreateKotlinClass(Long.TYPE))) {
// unreachable code
} else if (Intrinsics.areEqual(var2, Reflection.getOrCreateKotlinClass(Integer.TYPE))) {
var10000 = NDArray.Companion.getIntFactory();
var10001 = CollectionsKt.toIntArray((Collection)input.shape());
$receiver$iv$iv = var10000.zeros(Arrays.copyOf(var10001, var10001.length));
$receiver$iv$iv = $receiver$iv$iv;
idx$iv$iv = 0;
for(var7 = $receiver$iv$iv.getSize(); idx$iv$iv < var7; ++idx$iv$iv) {
it = $receiver$iv.getDouble(idx$iv$iv);
int var25 = (int)it + 15; // <--- Here's the crossinline'd fn. No boxing.
$receiver$iv$iv.setInt(idx$iv$iv, var25);
}
var28 = $receiver$iv$iv;
} else if (Intrinsics.areEqual(var2, Reflection.getOrCreateKotlinClass(Short.TYPE))) {
// unreachable code
} else if (Intrinsics.areEqual(var2, Reflection.getOrCreateKotlinClass(Byte.TYPE))) {
// unreachable code
} else {
// unreachable code
}
}
return var28;
}
If you wanted to benchmark a version with boxing, you'd need to break reified-at-runtime trick, with something that forces the use of java generics like this:
inline fun <reified T> NDArray<Double>.badMap(noinline f: (Double) -> T) =
NDArray(*shape().toIntArray()) { f(getDouble(*it)) }
It just occurred to me we may all be overthinking this -- if T
and R
were both reified, we could just nest two instances of the reified-at-runtime trick. There'd be a boatload of unreachable invalid bytecode, but the JVM's branch prediction is smart enough that there'd be practically no runtime cost.
I may be the source of confusion as I was too brief in my above comment. Kotlin does perform an optimization for lambdas that are passed into inline functions, which is something we exploit in NDArray.invoke
and elsewhere to get around the boxing costs of passing in generic lambdas. This optimization includes inlining the lambda body into the caller, and if the caller is passing in primitives to avoid boxing. Thus it will avoid creating a Function
object at all in such a scenario. You can see that optimization at work in @drmoose 's disassembled code, which produces no Function
object and performs no boxing.
What I was referring to before is the fact that inline functions won't specialize on insertion, i.e. Kotlin's compiler won't create a type-specific version of the function for each value of T
. This is the reason we have to do the when
block dispatch, as otherwise it'll do generic (boxed) invocation.
Some reading on the topic: https://medium.com/@BladeCoder/exploring-kotlins-hidden-costs-part-1-fbb9935d9b62
WRT this issue, I think using the ugly NDArray.invoke
when
block dispatch hack for each primitive is a reasonable workaround for now. I'm still hopeful that valhalla will complete some day, so these solutions will only be temporary.
we could just nest two instances of the reified-at-runtime trick
...aaand I'm wrong. getDouble(*it) as T
is a boxing cast even when T
is reified.
No wait, I'm wrong about being wrong. getDouble(*it) as T
inside a when (T::class) ->
isn't a boxing cast. I was reading one of the unreachable branches in the disassembly by mistake, so having a when(T::class)
where each of its branches are when(R::class)
may be a viable approach, albeit a horribly ugly one.
For posterity, here are the test functions I used to investigate, which I think are perhaps partway down the right road. Any real solution would want to avoid the NDArray.invoke
s I have here in favor of fillLinear
, because as @peastman pointed out invoke allocates bunches of IntArray
s we don't need. I'd flesh this out more, but at some point I need to go back to doing my actual job.
inline fun <reified R> NDArray<Double>.mapDouble(crossinline f: (Double) -> R) =
NDArray(*shape().toIntArray()) { f(getDouble(*it)) } as NDArray<R>
inline fun <reified T, reified R> NDArray<T>.bruteForceMap(crossinline f: (T) -> R) =
when (T::class) {
Double::class -> (this as NDArray<Double>).mapDouble { f(it as T) }
else -> NDArray(*shape().toIntArray()) { f(get(*it)) }
} as NDArray<R>
fun testBruteForceMap(input: NDArray<Double>) = input.bruteForceMap { it.toInt() + 15 }
Now that I've had a moment to sit down and play with some code, I'm thinking the when
block isnt necessary at all. Starting with your original proposal @peastman:
inline fun <reified T> NDArray<Double>.testMap(crossinline f: (Double) -> T)
= NDArray(*shape().toIntArray(), filler={ f(this.getDouble(*it)) } )
Then adding a generic version:
fun <T, R> NDArray<T>.testMap(f: (T) -> R)
= NDArray.createGeneric<R>(*shape().toIntArray(), filler = { f(this.getGeneric(*it)) })
All of these calls should be okay:
val a = NDArray.doubleFactory.zeros(1, 2)
val outa1 = a.testMap { it+144441 }
val outa2 = a.testMap { "greetings" }
val b = NDArray.createGeneric<String>(5,5) { "hi" }
val outb = b.testMap { "bye" }
Moreover the boxing I thought would occur on inline fun <T> NDArray<Double>.map(crossinline f: (Double) -> T)
is apparently optimized out by the compiler:
if (Intrinsics.areEqual(var5, Reflection.getOrCreateKotlinClass(Double.TYPE))) {
$receiver$iv$iv$iv = this_$iv$iv.getDoubleFactory().zeros(Arrays.copyOf(dims$iv$iv, dims$iv$iv.length));
$receiver$iv$iv$iv = $receiver$iv$iv$iv;
var9 = $receiver$iv$iv$iv.iterateIndices().iterator();
while(var9.hasNext()) {
var10 = (IndexIterator)var9.next();
nd$iv$iv$iv = var10.component1();
linear$iv$iv$iv = var10.component2();
it = $receiver$iv.getDouble(Arrays.copyOf(nd$iv$iv$iv, nd$iv$iv$iv.length));
double var16 = it + (double)144441;
$receiver$iv$iv$iv.setDouble(linear$iv$iv$iv, var16);
}
var29 = $receiver$iv$iv$iv;
} else if (Intrinsics.areEqual(var5, Reflection.getOrCreateKotlinClass(Float.TYPE))) {
...
So it looks like I was wrong about the boxing that would occur for Double->T
inline functions. I was reasonably certain the compiler wouldn't optimize that away, but it seems to be not the case. Sorry for leading all of us down a rabbit hole for no reason. I think going with the original proposal might work.
How do you want me to handle the type parameters? Currently you use the macro ${genDec} which gets defined to either "" or "
". In this case it needs to be either " " or "<T,R>". I could modify the build script to define another macro ${genDecR}, if you think that seems right.
That sounds reasonable to me.
Note that there are two different compilers involved. There's the Kotlin compiler, which converts source to Java bytecode. And then there's Hotspot, which is the JIT that converts bytecode to native machine code. Scalar replacement is an optimization done by Hotspot. If you just disassemble the class files, you might still see them working with boxed integers. But if you disassemble the native machine code, you'll find that it's optimized them away.
Anyway, it's good to know the Kotlin compiler is also smarter than I thought. I was afraid the crossinline
meant it would have to implement f
as an actual function that returned Any
. I'll go ahead and try implementing it this way.
Note that there are two different compilers involved.
Indeed there are. Whenever possible I tend not to rely on HotSpot optimizations saving us though, as Koma also builds on native and javascript which may not be as smart, and there are other jvm implementations in the wild which may not be as optimized, and HotSpot may fail to detect the optimization in complicated situations. My hope is that by expressing things as optimized Kotlin in the first place we'll be in a better position to get good performance everywhere.
That being said, when running on OpenJDK it is useful to run with -XX:+PrintAssembly
enabled to see what asm the JIT is generating, but I mostly use it as a debugging tool when things are slower than I expect.
inline fun <reified T> NDArray<Double>.testMap(crossinline f: (Double) -> T)
= NDArray(*shape().toIntArray(), filler={ f(this.getDouble(*it)) } )
This version turns out to be about 50% slower in my test. The reason is that NDArray.invoke()
uses fill()
instead of fillLinear()
. I'd like to create an alternate version where filler
is an (Int) -> T
instead of (IntArray) -> T
, so it can use the more efficient fillLinear()
. But that would break type inference when filler
was a lambda (the compiler couldn't automatically determine what its input was meant to be). So it would have to have a different name, or go in a different class, or perhaps just be private. What do you prefer?
"different name" makes sense to me, NDArray.fromLinear(*dims) { /* filler */ }
? createLinear
?
The unnecessary IntArray
crops up in a bunch of other places too. In retrospect it might've made sense to have a dedicated NDArrayIndex type that supplies both (doing the linearToIdx
math as a lazy evaluation), but at this point such a change would break much code indeed. It might be worth looking into as a separate issue. @kyonifer?
A new method named createLinear
sounds good to me.
NDArray
'smap
is defined asfun map(f: (T) -> T): NDArray<T>
. It would be more useful ifmap
were a generic, in the sense offun map<R>(f: (T) -> R): NDArray<R>
.Here's some sample code which I'd like to compile, but can't:
Currently I'm using the following hacky workaround, which is dangerous and insane but seems to work.