gussmith23 / glenside

A pure, low-level tensor program representation enabling tensor program optimization via program rewriting. See the web demo at https://gussmith23.github.io/glenside-web-demo/
70 stars 10 forks source link

Glenside should find a blocking strategy for Mobilenet convolutions #42

Open gussmith23 opened 4 years ago

gussmith23 commented 4 years ago

This is mostly an issue of egg search performance and the fact that Glenside's blowing up the egraph with rewrites. See comment thread for debugging chain.

gussmith23 commented 4 years ago

I've been working on this, but I'm stuck even on the first convolution. I'm feeling stumped at the moment. Here's what I know:

The issue seems to be stemming from the fact that the image isn't getting split up into 64x64 chunks?

Searching for this pattern:

(compute dot-product
(access-cartesian-product
(access-pad
(access-pad
     (access-flatten
      (access (access-tensor conv_block_1_conv_weight) 1))
     zero-padding
     1
     0
     37)
     zero-padding
     0
     0
     32)
?0
)
)

Produces a few eclasses for ?0: a slice, and two pads, one of which is shown here:

EClass { id: 55671, nodes: [AccessPad([481, 6, 22, 22, 22])], data: AccessPattern(AccessPatternData { shape: IxDynImpl(Inline(1, [12544, 0, 0, 0])), item_shape: IxDynImpl(Inline(1, [64, 0, 0, 0])), zero_regions: ...

So the shape, 12544x64, is amenable to slicing into 64x64 chunks. And I have, in the rewrites:

        slice_concatenate_accesses(0, SliceConcatenateStrategy::DivideInto { segment_size: 64 }),
        slice_concatenate_accesses(1, SliceConcatenateStrategy::DivideInto { segment_size: 64 }),

This rewrite should fire on this access, should it not? 12544 is divisible by 64.

Need to figure out why it's not. It's definitely the largest access I've tried to split up with this rewrite. Maybe that's coming into play.

gussmith23 commented 4 years ago

I should say that, for ?0, I think I should see an access-concatenate, but I don't.

gussmith23 commented 4 years ago

So it seemed that the slice-concatenate rewrite wasn't firing. Doing some profiling, it seemed like there was a lot of time being spent in the rewrite: image

The rewrite is bulky. Specifically, it's bulky when using the DivideInto strategy, as the strategy slices up an access as many times as it can, all at once. I added a new strategy, SliceOnce which just slices an access once. Tentatively, it's at least partly working.

gussmith23 commented 4 years ago

Things are working a bit better, though even after an hour-long run on the RTML server, it still doesn't tensorize. Though, with more time, it always finds more systolic arrays.

gussmith23 commented 4 years ago

Another potential easy route is to profile the rewrites running in egg, and start banning rewrites.

gussmith23 commented 4 years ago

Experimenting with that. Egg actually uses a backoff scheduler by default, and bans active rewrites. I've made sure the important rewrites aren't getting banned.

gussmith23 commented 4 years ago

Just to be clear: so far, my thought has been that all the right rewrites are in place, and it's simply taking a very long time to tensorize because the blocking is insane. I should actually confirm hunches about a few things here, rather than going off assumptions:

gussmith23 commented 4 years ago

For the time being, I'm shelving this in favor of #43. I want Glenside to be able to statically block up computations, but honestly, it's not worth the effort right now.