microsoft / knossos-ksc

Compiler with automatic differentiation
Other
45 stars 10 forks source link

gelu upper bounds #959

Closed toelli-msft closed 2 years ago

toelli-msft commented 2 years ago

Adding some benchmarks for gelu, along the same lines as relu3.

Duplication

There is duplication between the embedded_cpp_entry_points and flags here and in relu3.py but my inclination is to merge this first and then tidy up the duplication later. The embedded_cpp_entry_points in particular will need @dcrc2's help to fix I think.

Results

Sadly not as much luck with gelu as with relu3. The same approach (simple loop with compiler flags) is 8.9x slower than PyTorch on size 1M and 5.9x slower than PyTorch on size 65k (although 1.8x faster on size 16).

I don't have a good intuition for why we get these numbers. Any ideas for improvements @awf @dcrc2? I will play around a bit, particularly with flags.

image

toelli-msft commented 2 years ago

-ffp-contract=fast, -flto and -fno-semantic-interposition (see https://github.com/microsoft/knossos-ksc/issues/956) seem to hurt a bit for large problem sizes but help a bit for small problem sizes

image

(5c4f2f7c505da8be45bcbd2f1ddd0882bb92e764)

toelli-msft commented 2 years ago

There's a tremendously useful flag -fopt-info-vec-all from which I learn

/home/toelli/knossos-ksc/build/torch_extensions/ksc_dl_activations__manual__vgelu_embedded_cpp_inlined_map_flags/ksc-main.cpp:26:54: missed: couldn't vectorize loop
/home/toelli/knossos-ksc/build/torch_extensions/ksc_dl_activations__manual__vgelu_embedded_cpp_inlined_map_flags/ksc-main.cpp:30:39: missed: not vectorized: relevant stmt not supported: _7 = erf (_6);

so looks like there are troubles vectorising erf. I will investigate how to make that happen. Presumably PyTorch is doing it.

toelli-msft commented 2 years ago

Vectorising erf on float32s is a thing but only for up to 256 bits at a time.

toelli-msft commented 2 years ago

256 bits at a time is 8 float32s and we are (a bit more than) 8x slower than PyTorch, so this seems to be a smoking gun.

toelli-msft commented 2 years ago

Perhaps gcc just doesn't vectorise erf (it didn't vectorise trig functions nine years ago at least). One group of suggestions in that StackOverflow is to find a library that does vectorise erf. Is that our best approach? I am rather out of my depth here.

awf commented 2 years ago

I wonder about trying the intel compiler -- that's one set of flags across linux and windows.

dcrc2 commented 2 years ago

Perhaps gcc just doesn't vectorise erf (it didn't vectorise trig functions nine years ago at least). One group of suggestions in that StackOverflow is to find a library that does vectorise erf. Is that our best approach? I am rather out of my depth here.

We could try #pragma omp simd. I've no experience with this but it looks like it could be a good fit.

Alternatively, maybe we should add a handwritten example which uses _mm256_erf_ps directly. It would be useful to know whether this is sufficient to match PyTorch, or whether there are some further tricks needed.

toelli-msft commented 2 years ago

I wonder about trying the intel compiler -- that's one set of flags across linux and windows.

Currently we hand over responsibility for choice of compiler to PyTorch (and thence I think to setuptools)

https://github.com/microsoft/knossos-ksc/blob/0ab1a292db2e0e4f47be6ad151ef910f3089eb35/src/python/ksc/compile.py#L334-L341

Perhaps we can configure the compiler using something like https://setuptools.readthedocs.io/en/latest/deprecated/distutils/apiref.html?highlight=Extension#module-distutils.ccompiler

toelli-msft commented 2 years ago

Now at~12 minutes on CI. There's plenty of time overnight but we'll need to figure out how much to run and where if we go much bigger e.g. we might not want to do larger sizes on PR checks.

Yes, CI time is going to become a significant problem in the near future I think. These benchmarks don't actually need to run in CI, or at least we don't need to get real numbers out of them. It would be fine to run them for a tiny number of rounds, say 5.

cgravill commented 2 years ago

We do have to be a bit careful about compiler modes and alternative compilers as we're building Python extensions. I ran into issues with, on Windows, building with gcc then trying to use in an msvc built Python. Very possibly doable, but might be involved to add icc to the mix.

cgravill commented 2 years ago

Currently we hand over responsibility for choice of compiler to PyTorch (and thence I think to setuptools)

https://github.com/microsoft/knossos-ksc/blob/0ab1a292db2e0e4f47be6ad151ef910f3089eb35/src/python/ksc/compile.py#L334-L341

Perhaps we can configure the compiler using something like https://setuptools.readthedocs.io/en/latest/deprecated/distutils/apiref.html?highlight=Extension#module-distutils.ccompiler

Just for information, we use the JIT extension route rather than setuptools route:

https://pytorch.org/tutorials/advanced/cpp_extension.html#jit-compiling-extensions

toelli-msft commented 2 years ago

we use the JIT extension route rather than setuptools route

Is it not the same thing?

All arguments are forwarded to the setuptools.Extension constructor.

https://pytorch.org/docs/stable/cpp_extension.html

cgravill commented 2 years ago

we use the JIT extension route rather than setuptools route

Is it not the same thing?

All arguments are forwarded to the setuptools.Extension constructor.

https://pytorch.org/docs/stable/cpp_extension.html

Concretely, we don't trigger setuptools itself which has issues on Windows with venv. Instead there's a more custom pipeline that PyTorch provides of writing out Ninja files, building, and dynamically loading for us.

toelli-msft commented 2 years ago

OK, in that case probably safest to stick with what PyTorch provides simply. Since PyTorch is (presumably) getting good performance with erf under gcc then there must be a way for us to do so too. David suggested some ways https://github.com/microsoft/knossos-ksc/pull/959#issuecomment-884095737