halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.86k stars 1.07k forks source link

Faster vars used tracking in simplify let visitor #8205

Closed abadams closed 5 months ago

abadams commented 5 months ago

This visitor shows up as the main cost of lowering in very large pipelines.

This visitor is for tracking which lets are actually used for real inside the body of a let block (as opposed to the tracking we do when mutating, which is approximate, because we could construct an Expr that uses a Var and then discard it in a later mutation).

The old implementation made a map of all variables referenced, and then checked each let name against that map one by one. If there are a small number of lets outside a huge Stmt, this is bad, because the data structure has to hold a number of names proportional to the Stmt size instead of proportional to the number of lets.

This new implementation instead makes a hash set of the let names, and than traverses the Stmt, removing names from the set as they are encountered. This is a big speed-up.

We then make the speed-up larger by about the same factor again doing the following:

1) Only add names to the map that might be used based on the recursive mutate call. These are very very likely to be used, because we saw them at least once, and mutations that remove all uses of a Var are rare.

2) The visitor should early out when the map becomes empty. The let variables are often all used immediately, so this is frequent.

Speeds up lowering of local laplacian by 1.44x, 2.6x, and 4.8x for 20, 50, and 100 pyramid levels respectively.

Speeds up lowering of resnet50 by 1.04x.

Speeds up lowering of lens blur by 1.06x

Built on #8202

Complementary with #8204 though the total speed-up from merging both is not going to be the product of the two speed-ups. The two PRs solve the same local laplacian pathological lowering time problem in two different ways.

rootjalex commented 5 months ago

It looks like this PR also contains #8202 ? Might need to merge with main to clean it up

abadams commented 5 months ago

Merge with main done

abadams commented 5 months ago

With the last merge, when combined with #8204, this is 6.7x faster for local laplacian with 100 pyramid levels. But Amdahl's law has kicked in here. For just the pass that was slow (the simplification pass just after vectorization, because it did the first simplification after unify_duplicate_lets), this is actually over 400x faster.

rootjalex commented 5 months ago

Is it reasonable to write tests for this new functionality, or do existing tests already stress it enough?

abadams commented 5 months ago

Everything breaks if you remove a let that you should not have. I also confirmed the existing tests fail if you leave too many lets in (it messes up some instruction selection tests).

abadams commented 5 months ago

Just needs an approval