Closed toelli-msft closed 2 years ago
-ffp-contract=fast
, -flto
and -fno-semantic-interposition
(see https://github.com/microsoft/knossos-ksc/issues/956) seem to hurt a bit for large problem sizes but help a bit for small problem sizes
(5c4f2f7c505da8be45bcbd2f1ddd0882bb92e764)
There's a tremendously useful flag -fopt-info-vec-all
from which I learn
/home/toelli/knossos-ksc/build/torch_extensions/ksc_dl_activations__manual__vgelu_embedded_cpp_inlined_map_flags/ksc-main.cpp:26:54: missed: couldn't vectorize loop
/home/toelli/knossos-ksc/build/torch_extensions/ksc_dl_activations__manual__vgelu_embedded_cpp_inlined_map_flags/ksc-main.cpp:30:39: missed: not vectorized: relevant stmt not supported: _7 = erf (_6);
so looks like there are troubles vectorising erf
. I will investigate how to make that happen. Presumably PyTorch is doing it.
Vectorising erf
on float32
s is a thing but only for up to 256 bits at a time.
256 bits at a time is 8 float32
s and we are (a bit more than) 8x slower than PyTorch, so this seems to be a smoking gun.
Perhaps gcc
just doesn't vectorise erf
(it didn't vectorise trig functions nine years ago at least). One group of suggestions in that StackOverflow is to find a library that does vectorise erf
. Is that our best approach? I am rather out of my depth here.
I wonder about trying the intel compiler -- that's one set of flags across linux and windows.
Perhaps
gcc
just doesn't vectoriseerf
(it didn't vectorise trig functions nine years ago at least). One group of suggestions in that StackOverflow is to find a library that does vectoriseerf
. Is that our best approach? I am rather out of my depth here.
We could try #pragma omp simd
. I've no experience with this but it looks like it could be a good fit.
Alternatively, maybe we should add a handwritten example which uses _mm256_erf_ps
directly. It would be useful to know whether this is sufficient to match PyTorch, or whether there are some further tricks needed.
I wonder about trying the intel compiler -- that's one set of flags across linux and windows.
Currently we hand over responsibility for choice of compiler to PyTorch (and thence I think to setuptools
)
Perhaps we can configure the compiler using something like https://setuptools.readthedocs.io/en/latest/deprecated/distutils/apiref.html?highlight=Extension#module-distutils.ccompiler
Now at~12 minutes on CI. There's plenty of time overnight but we'll need to figure out how much to run and where if we go much bigger e.g. we might not want to do larger sizes on PR checks.
Yes, CI time is going to become a significant problem in the near future I think. These benchmarks don't actually need to run in CI, or at least we don't need to get real numbers out of them. It would be fine to run them for a tiny number of rounds, say 5.
We do have to be a bit careful about compiler modes and alternative compilers as we're building Python extensions. I ran into issues with, on Windows, building with gcc
then trying to use in an msvc
built Python. Very possibly doable, but might be involved to add icc
to the mix.
Currently we hand over responsibility for choice of compiler to PyTorch (and thence I think to
setuptools
)Perhaps we can configure the compiler using something like https://setuptools.readthedocs.io/en/latest/deprecated/distutils/apiref.html?highlight=Extension#module-distutils.ccompiler
Just for information, we use the JIT extension route rather than setuptools
route:
https://pytorch.org/tutorials/advanced/cpp_extension.html#jit-compiling-extensions
we use the JIT extension route rather than setuptools route
Is it not the same thing?
All arguments are forwarded to the setuptools.Extension constructor.
we use the JIT extension route rather than setuptools route
Is it not the same thing?
All arguments are forwarded to the setuptools.Extension constructor.
Concretely, we don't trigger setuptools
itself which has issues on Windows with venv
. Instead there's a more custom pipeline that PyTorch provides of writing out Ninja files, building, and dynamically loading for us.
OK, in that case probably safest to stick with what PyTorch provides simply. Since PyTorch is (presumably) getting good performance with erf
under gcc
then there must be a way for us to do so too. David suggested some ways https://github.com/microsoft/knossos-ksc/pull/959#issuecomment-884095737
Adding some benchmarks for gelu, along the same lines as relu3.
Duplication
There is duplication between the
embedded_cpp_entry_points
and flags here and inrelu3.py
but my inclination is to merge this first and then tidy up the duplication later. Theembedded_cpp_entry_points
in particular will need @dcrc2's help to fix I think.Results
Sadly not as much luck with gelu as with relu3. The same approach (simple loop with compiler flags) is 8.9x slower than PyTorch on size 1M and 5.9x slower than PyTorch on size 65k (although 1.8x faster on size 16).
I don't have a good intuition for why we get these numbers. Any ideas for improvements @awf @dcrc2? I will play around a bit, particularly with flags.