gussmith23 / glenside

A pure, low-level tensor program representation enabling tensor program optimization via program rewriting. See the web demo at https://gussmith23.github.io/glenside-web-demo/
71 stars 10 forks source link

Implement conv2d-im2col-fc #79

Closed gussmith23 closed 3 years ago

gussmith23 commented 3 years ago

Scott can support an im2col'ed conv2d fully-connected systolic array invocation taking NHWC input. This means we can fold a bunch more access patterns into a hardware op, making us generate less code on our side (and hopefully making things faster by relying on his stuff more).

To be clear, what I mean is this: There are roughly three ways that we've discussed implementing convolutions.

  1. "normal" convolutions. See #45 for a description of these. I have more info in my notes from talking with Scott.
  2. Im2col convolutions, where the data is already in im2col format -- we say that this is a "fully connected" conv2d, because we're computing a conv2d but it looks like a single fully-connected (i.e. matmul/dense) invocation, because of the im2col trick. These are already implemented in Glenside.
  3. Im2col convolutions, where the data is in memory as NHWC and Scott's hardware reads it in as im2col. This is what we want to implement here.
gussmith23 commented 3 years ago

This is interesting for Glenside because layout comes up in a big way.

It's taken me til now to realize that all of our current systolic array usages have seemed "simple" in my mind because (maybe among other things) they're not layout-dependent. That is, when we're using a systolic array just as matrix multiplier, there's a very simple concept of layout, because there's only two dimensions in each input. All of our systolic array usages just use the systolic array as a matrix multiplier -- "FC" (fully connected) mode as Scott likes to say. I've been mentally blocked on how we'll implement systolic arrays in "conv2d" mode (as opposed to FC mode), knowing that it would be more complicated, but unsure as to why. Now, I realize that it's (at least partly) because the systolic array makes assumptions about layout (Scott's prefer NHWC/HWIO), and so, when we create the rewrites to map to these pieces of hardware, we need to be able to detect layout.

As I think about it actively, though, I'm thinking it shouldn't be a problem, or not as big a problem as I thought. It will be kinda hacky, but not too bad.

The core thing to think about is that the Glenside implementation of conv2d already contains implicit layout information. We are doing very specific things with each dimension, and so we can tell the layout based on the computation. In the future, dimensions and layouts will be more explicit (see #6) and we shouldn't need to "infer" layout from computation pattern, as I'm describing here. But for now, we need to match on the computation that we know represents a conv2d in a specific layout. Specifically, we implement conv2ds on NCHW tensors -- NHWC tensors are first transposed to NCHW. So we can match on the NCHW convolution and turn it into a systolic-array-conv2d-nchw, which we don't actually have an implementation for. Then, from there, we can make a whole bunch of rewrites that invoke other systolic arrays:

This isn't exactly ideal -- namely because Glenside seems to be implying a layout preference, which it is. Glenside has a layout preference for NCHW when expressing convolutions, which arises purely due to how the language/how access patterns are currently structured. In the future, I hope that #6 will eliminate this preference.

gussmith23 commented 3 years ago

I think this is mostly done now, with #82. It seems like it's working in the eval; we can find the new systolic array types in Resnet. This is great; this means we're probably able to start doing some interesting extraction with different sizes/types of arrays.

gussmith23 commented 3 years ago

Closing. We'll see if we end up using this, but it's in the language, at least!