Poor performance for generated code for operations bottle necked on advanced indexing

fdrocha commented 2 years ago

Came across the this issue when comparing performance in TorchInductor for the lowering in https://github.com/pytorch/torchdynamo/pull/1270 and the decomp in https://github.com/pytorch/pytorch/pull/85403.

It seems that most of difference is coming from the fact that in the lowering I was able to use int32s for computed indices, while in the decomp I am forced to use int64s (otherwise I get an exception IndexError: tensors used as indices must be long, byte or bool tensors). Just by changing the lowering to use int64 instead of int32d, it became 53% slower which accounts for most of the difference of performance to the decomp.

Note that all the benchmarks were ran in desktop cards (GeForce RTX 2060), this is possibly much less of an issue on server cards.

Any thoughts @ezyang, @jansel, @ngimel, @Chillee?

ezyang commented 2 years ago

32-bit indexing buys us a lot on consumer cards; most of our eager kernels are written for two indexing regimes. However, I think it matters less for the V100s and A100s, and inductor is explicitly NOT targeting consumer cards right now.

That being said, I think in the long term there will continue to be hobbyist interest in inductor performance on consumer cards, so maybe we can come up with an indexing strategy that is not too disruptive for inductor in general but can automatically specialize to smaller indices. However, I do NOT believe that this should be done at the decomp level; instead, the compiler should know when it can narrow index types when the tensors are small enough.

fdrocha commented 2 years ago

That makes sense, thank you. In the meanwhile, do you think the lowering should explicitly check the size of tensor and use int32/int64? Or just unconditionally use int64?

Sent from my iPhone

On 21 Sep 2022, at 14:58, Edward Z. Yang @.***> wrote:

32-bit indexing buys us a lot on consumer cards; most of our eager kernels are written for two indexing regimes. However, I think it matters less for the V100s and A100s, and inductor is explicitly NOT targeting consumer cards right now.

That being said, I think in the long term there will continue to be hobbyist interest in inductor performance on consumer cards, so maybe we can come up with an indexing strategy that is not too disruptive for inductor in general but can automatically specialize to smaller indices. However, I do NOT believe that this should be done at the decomp level; instead, the compiler should know when it can narrow index types when the tensors are small enough.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

ezyang commented 2 years ago

Well, let's check with the inductor folks about how easy/hard it would be to infer that a tensor is for indexing. If it's easy, no need to goop up the decomps.

soumith commented 2 years ago

I think Inductor should make a pass for automatically strength-reducing the generated code to int32 if the input tensor sizes are all less than int32 limits. That's what we do in ATen right now to get those performant kernels.

cc: @ngimel @jansel

soumith commented 2 years ago

so, for the decomps, in my opinion, we keep the int64 indexing as-is

jansel commented 2 years ago

Yes, it should be possible to do this automatically.

fdrocha commented 2 years ago

I think Inductor should make a pass for automatically strength-reducing the generated code to int32 if the input tensor sizes are all less than int32 limits. That's what we do in ATen right now to get those performant kernels.

It might be a little more subtle than that actually. In the decomp in https://github.com/pytorch/pytorch/pull/85403 for instance, you need to compute x indices up to width+1, and clamp to width-1. So to use int32 you actually need width+1 <= int_max, not just width < int_max.

lezcano commented 2 years ago

That's a good point. Now, for all practical applications, we can probably check that it doesn't have more than 2^31 - 17 really.

fdrocha commented 2 years ago

Related to this, I ran across the add.sat.s32 PTX instruction. It might be worth considering using this in the lowerings that might overflow int32.

pytorch / torchdynamo

Poor performance for generated code for operations bottle necked on advanced indexing #1293