pytorch / benchmark

TorchBench is a collection of open source benchmarks used to evaluate PyTorch performance.
BSD 3-Clause "New" or "Revised" License
843 stars 271 forks source link

Benchmark Channels Last #277

Open wconstab opened 3 years ago

wconstab commented 3 years ago

channels-last has an API already: https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html so it may be as simple as doing model.to(memory_format=torch.channels_last) and making sure the same happens to inputs.

@jamesr66a I want to clarify the value prop of doing this at the benchmark infra level. The downside is in runtime cost to collect 2x the measurements and sift through 2x the data. The upside is you get some new signal that's potentially useful. Another (hidden) downside may be that we miss the change to incorporate channels-last as an optimization that we do automatically in our compiler. (granted, we don't have a very full story for using compiler techniques on training benchmarks so that's a gap right now).

I'm conflicted on adding this for the above reason. Thoughts?

jamesr66a commented 3 years ago

I don't really see how benchmarking channels-last as an explicit API precludes us from doing automatic optimizations in our compiler. On the contrary, it can help expose gaps and create development targets for doing so

Additionally, I don't think we should be siloing "The Compiler™" v.s. PyTorch the whole-product. One of the goals of PyTorch is to ensure that the user can get the performance they want. Whether that comes from The Compiler™ is a detail.

wconstab commented 3 years ago

Yea, I shouldn't have said 'the compiler'. What I meant by that is that we want to deliver perf enhancements to users without them changing their code. We also sometimes want to deliver perf enhancements that do require them changing their code. Whether 'the compiler' or something else delivers the former is inconsequential.

But for something like channels last, a user could change their code to enable it today. In setting up the suite, we explicitly didn't go an hand-optimize the model code, instead, we used the models as they were in the wild. This is a proxy for finding a balance between what totally naive users might do and what our most advanced perf guides recommend.

So with that framing, do you argue for

jamesr66a commented 3 years ago

So I think my thinking comes from the fact that the API-facing layout (NCHW) was actually an arbitrary choice iiuc in that that's what CuDNN did at the time. However, in thinking about how to best implement these operations on a concrete machine nowadays, many (most?) machines prefer NHWC (including GPUs ironically). I'm basically just pointing that testing a lower-performance case due to historical baggage isn't ideal.

I think there's a separate conversation to be had about how aggressively we should make this optimization automatic, but I think that's orthogonal to whether we benchmark these things or not.

Let me think about which of the options would fit with my thinking here