halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.88k stars 1.07k forks source link

Consider removing TBAA metadata #3322

Open steven-johnson opened 6 years ago

steven-johnson commented 6 years ago

While investigating a Halide pipeline inside Google that compiled very slowly, @alinas commented:

The issue does not appear to be recent. ... Here's my suggestion as an alternative for improving compile times today: remove TBAA tags generated by Halide. These are supposed to help with alias analysis but in the tests I've been looking for for a while now, these do not help and add a lot of overhead when used. I also have fairly strong suspicions that the tbaa metadata is not generated correctly, because I'm seeing different optimization results when I use the tbaa tags that are inconsistent and do not make sense - happy to elaborate, if interested. ... For [the test in question], compile time for the module drops from 85s to 55s when TBAA tags are not generated.

If TBAA tags are slowing compilation, and not helping code-quality output, we obviously should consider this; this is an area that predates my work on Halide, so I haven't looked into it. Opening an issue for discussion.

abadams commented 6 years ago

They were absolutely necessary to get llvm to do something sane about 4-5 years ago, but llvm has probably done a lot with alias analysis since. It also used to be the case that llvm alias analysis was hopelessly slow without some kind of tbaa hinting to cut it short.

I assume this is all about the constant-offset tbaa tags. We should definitely still be telling llvm different buffer inputs do not alias - there's no way to infer that from the code.

I've also been wondering if we can cut down on the set of llvm passes. E.g. do we really want autovectorization of loops? For well-scheduled code it tends to just increase code size by vectorizing our scalar loop tails. I think we definitely want the SLP vectorizer.

These questions can only really be answered by turning things off and benchmarking all the important things. The app most sensitive to this in the public repo is the fft, and I don't see any difference on it with or without the constant tbaa tags.

alinas commented 6 years ago

I agree entirely with you that there may be a lot of things that can be cleaned up, and the way to go is to turn things off, benchmark and see if anything if affected.

My samples size is fairly small, since I've only looked at pathological cases. For these, adding tbaa tags increases compile time consistently.

For the test Steven mentioned, I only looked at impact on compile time. Decreasing the compile time may however introduce missed optimizations.

But for other tests I looked at from the point of view of LLVM's LICM pass which relies heavily on alias analysis. For these I looked at compile time + optimizations performed in that time. To get an idea of the time increase, here's a few numbers for comparison: Test Time (s) w/ basicaa Time (s) w/ basicaa + tbaa Test1 6.055 54.277 Test2 8.924 112.306 Test3 7.942 81.472 Again, these are pathological test cases, and please note this is a debug build, but the time increase is still too significant.

What makes this worse, is that, in the context of LICM, we are interested in taking loads and stores out of loops when legal in order to decrease runtime. The statistics for the above tests showed less instructions being taken out with TBAA (e.g. for above Test1, instructions hoisted dropped from 59814 to 45691). Now, considering that BasicAA is fairly conservative, I'm inclined to think the TBAA are either incorrect or overly conservative, so they prevent the extra optimizations.

I've analyzed a few loops that fail the optimizations with TBAA, but I need to dig deeper, as the loops in question have hundreds of instructions and are fairly hard to understand visually in the context of the size of the function.

My preliminary conclusion is that having the TBAA tags in the current format increases compile and may increase runtime as well. We certainly need more benchmarking of the critical apps to determine how to move forward.

zvookin commented 6 years ago

@abadams wrote: "E.g. do we really want autovectorization of loops?" No.

abadams commented 6 years ago

With wider benchmarking one easy thing we should test is dropping to the equivalent of O2 or Os in the pass manager and seeing what happens. We'd probably still want O3 for the backends.

steven-johnson commented 6 years ago

I'm working on connecting to some benchmarking tools inside Google that I think will make running these experiments a whole lot more tractable on our side. Probably will take a day or three to get things up and running on my side.

dsharletg commented 6 years ago

I'm pretty sure we added the TBAA to make the FFT fast, so we should benchmark that to see how removing it affects things.

abadams commented 6 years ago

The public fft is the one thing I did benchmark, and it made zero difference.

dsharletg commented 6 years ago

I just benchmarked a pretty big variety of code, and I'm seeing quite a lot of 10-25% regressions when removing TBAA.

dsharletg commented 6 years ago

It would be nice if we had a simpler mechanism for communicating aliasing information. Our use of TBAA is a kludge.

I suppose another option is to figure out what optimizations are taking advantage of TBAA and do them ourselves?