Open limo1996 opened 3 years ago
@ezyang
We're not fundamentally opposed to AVX512, but it will be difficult to determine when we can actually safely turn it on because on Intel it limits max turbo frequency and so can be a pessimization in some cases; additionally, sometimes the AVX512 FLOPS are about the same as AVX2 and so not an improvement either. That being said, I don't think it's a big problem if we start adding infrastructure for it, so yes, I think we'd accept patches for this.
cc @ngimel @malfet @VitalyFedyunin @dzhulgakov
Cool! I have 2 questions regarding the process.. Should I work on my fork or have a branch here? Should I create one huge pull request in the end or do it incrementally? Thanks.
Incremental is better; you could use https://github.com/ezyang/ghstack/ (I may need to give you perms to push branches). Testing strategy is going to be the first and foremost question reviewers will have.
Ok. I will try to use ghstack for managing multiple depending PRs. You mean permit to push branches into pytorch right? Yes, can you please give me that perm so I can start working on it. Thanks. Regarding testing.. Unit tests are trivial since they already exist for Vec256 and they are generic enough to handle Vec512 if tweaked a bit. The more problematic will be to test kernels with Vec512 usage.. Here we can test that each kernel should produce the same output both for Vec256 and Vec512.. If there is some test suite for kernels then we don't need to do it.
Open a dummy PR to pytorch repo first so CLA bot can ask you for a CLA, then I can give you rights.
Here we can test that each kernel should produce the same output both for Vec256 and Vec512.. If there is some test suite for kernels then we don't need to do it.
We have test suite for kernels but they're not going to test AVX512 by default without prodding.
Done
If you already have some code, could you please make DRAFT PR and CC me into it, so we can provide you with early feedback. Ideally, we would love to have vec256 become vec
My idea was to have unifying header for vec256 and vec512 so in the end only vec is exposed. Something like:
#if defined(CPU_CAPABILITY_AVX512)
template <typename T>
using vec = typename at::vec512::Vec512<T>;
#elif defined(CPU_CAPABILITY_AVX2)
template <typename T>
using vec = typename at::vec256::Vec256<T>;
#endif
where vec512 would implement the same fuctionality as vec256. What do you think?
I will push some code as soon as my company approves CLA..
Hello @ezyang, even I wanted to work on it a few weeks back, but I searched old issues with strings such as AVX512
& vectorization
, and found an issue which was closed stating PyTorch already supports AVX512, so I thought AVX512 vectorization was somehow supported implicitly. That's why I thought as a vectorization project, I could only enable float16
on AVX2 (and AVX512 implicitly). π
Enabling AVX512 isn't very time-consuming, as in most places, one can simply replace 256
with 512
& 32
with 64
. For BFloat16, though, some newer intrinsics can potentially be used.
May I assist with this initiative? I'm very excited about it!
We're not fundamentally opposed to AVX512, but it will be difficult to determine when we can actually safely turn it on because on Intel it limits max turbo frequency and so can be a pessimization in some cases; additionally, sometimes the AVX512 FLOPS are about the same as AVX2 and so not an improvement either.
If workloads are mostly AVX512, then it'd indeed be faster. Also, as you already know, a complete cache line is used in AVX512 computations, so that makes computation more memory-bound than AVX2, so to speak. I think that's why Intel Ice Lake & Tiger Lake processors have 48 KB L1D caches instead of the normalized 32 KB, so AVX512 is even faster on them! Starting with Intel 11th gen, Intel is even churning out low-end laptop processors with AVX512 support, and even these have 48 KB L1D caches! So, it's probably time to jump on the AVX-512 bandwagon.
Intel HW prefetcher can't prefetch across page boundaries, so determining a prefetch distance & using prefetch compiler builtins (or assembly) can further speed up AVX512 computation in some cases.
found an issue which was closed stating PyTorch already supports AVX512, so I thought AVX512 vectorization was somehow supported implicitly
@imaginary-person Huh, that's not my recollection, and is not consistent with #26109 and #4825 which are two older issues on the subject.
May I assist with this initiative? I'm very excited about it!
Sure! Though I suppose you and @limo1996 will have to figure out how to coordinate patches.
@imaginary-person Huh, that's not my recollection, and is not consistent with #26109 and #4825 which are two older issues on the subject.
Ah, you're right! I had misread #51466. I missed an opportunity there π
Sure! Though I suppose you and @limo1996 will have to figure out how to coordinate patches.
Thank you! :)
@imaginary-person I am working on a draft PR. Will let you know once it's ready and we can start to iterate on it. Thanks!
We do use AVX512 when OneDNN kernels are called. However all point-wise operators (and everything else based on Vec256) are still on AVX2.
Bare in mind the fact of adding AVX512 kernels will require us either making it separate compile option or increasing binary size to ship 4 types of kernels simultaneously (with properly updated dispatch mechanics)
I think having some option like USE_VEC512
would be a good start.. Later on we can try to ship it simultaneously.
@malfet suggests that we might want to just kill AVX (no suffix), since most reasonable users will have AVX2.
Makes sense to me. I currently don't see a reason for having CPU_CAPABILITY_AVX
as it's always used in or
with CPU_CAPABILITY_AVX2
.
@ezyang, I remember someone having posted an issue last month, in which they mentioned they were using an old Xeon server for production, that didn't have AVX2! You had also replied to that post. There might be more such users, so maybe a deprecation notice can be provided for the benefit of such users now, perhaps via a sticky post, and then a brief mention in the release notes upon release? Thanks!
TBH, the user in question didn't even care about perf, it was a reproducibility problem. But sure, I made a post to https://dev-discuss.pytorch.org/t/dropping-avx-support-avx2-only/202
Thank you, @ezyang! That's the one. They weren't concerned about performance, but they mentioned that the server not having AVX2 was of a cloud computing platform. So, I thought such users might suddenly encounter perf problems with a newer PyTorch release, if they'd continue using old machines.
Ideally, we would love to have vec256 become vec with the support of various architectures.
@VitalyFedyunin @limo1996
I think that a cleaner way to have a single namespace vec
is:
aten/src/ATen/cpu/vec
with subdirectories vec256
and vec512
.vec256
namespace to vec
. The files in vec512
directory would also have vec
namespace.Vec256
(and Vec512
) to a common name, say Vec
, and then replacing Vec256
with Vec
everywhere in the current codebase.aten/src/ATen/cpu/vec
would have vec.h
and functional.h
files, apart from subdirectories vec256
and vec512
.
For example, if AVX512 is to be enabled, vec.h
can simply have the following content:
#if defined(CPU_CAPABILITY_AVX512)
#include <ATen/cpu/vec512/vec512.h>
#else
#include <ATen/cpu/vec/vec256/vec256.h>
vec.h
should be included wherever vec256.h
has been included (apart from vec.h
, and the files in the vec256
subdirectory, of course).A draft PR (#56992) has been created to illustrate.
Please share your opinion. Thanks!
Bare in mind the fact of adding AVX512 kernels will require us either making it separate compile option or increasing binary size to ship 4 types of kernels simultaneously (with properly updated dispatch mechanics)
Thanks to the way compilation is being performed, this option works for shipping all types of kernels (No-vec, AVX2, and AVX512 since I disabled AVX in that draft PR).
Please confirm which AVX-512 instructions sets should be initially supported. For instance, Caffe2 Perfkernels seem to use AVX512F
, AVX512VL
, and AVX512DQ
instruction sets.
The latest Intel Xeon server I've access to is actually an old one from Q3'17 (Intel Xeon Gold 6142), which supports AVX512F
, AVX512BW
, AVX512CD
, AVX512DQ
, AVX512VL
.
However, I don't have access to newer Intel (AMD currently doesn't seem to provision AVX512 support, but will do so in the near future) machines that support newer AVX512 instruction sets, such as the AVX512-VBMI
category (eg. _mm512_permutex2var_epi8
). Should we create some CPU capability flags for more recent AVX-512 instruction sets such as AVX512-VBMI
as well, and use those flags to conditionally use different intrinsics, wherever appropriate? Thank you!
Hey @limo1996, I added AVX512 support in ATen via vec512
in the draft PR #56992 (It's working locally but might take me some time to get things sorted on the CI infrastructure). It compiles with AVX512 instruction sets AVX512F
, AVX512BW
, AVX512DQ
, and AVX512VL
. Maybe we can combine this draft PR with the one you'd submit.
Right now, I'd like to focus on correctness, before focusing on optimization.
Hey @ezyang,
Testing strategy is going to be the first and foremost question reviewers will have.
We have test suite for kernels but they're not going to test AVX512 by default without prodding.
Can you please clarify which test suite you're referring to? Thanks!
I'll modify vec256_test_all_types.cpp
to test AVX512 as well & will rename this file to vec_test_all_types.cpp
.
Also, since both AVX2 & AVX512 use FMA, I was thinking of comparing their respective results for equality.
Any advice is most welcome! :)
Thank you!
Hello @limo1996, can you please confirm if your company has permitted you to contribute to PyTorch (the ATen AVX512 implementation)?
Since it's been more than a week, is there a possibility that they might not approve? If so, do you think they'd allow you to at least contribute on incremental revisions?
BTW, do you have access to machines that support AVX512-VBMI, and other newer AVX512 instruction sets? The machines available for general-access at my school don't support the latest AVX512 instruction sets.
Thanks!
@ezyang @VitalyFedyunin @limo1996
I think what might make such a PR easier to review for reviewers is annotated comments on the AVX512 intrinsics used that don't have corresponding AVX2 intrinsics.
From @apaszke's old comment, it looks like we might have to test on machines with GPUs as well. Perhaps we can run benchmarks to address this concern.
I'd complete testing the implementation today, as I'd have access to a Xeon Silver 4114 machine with an Nvidia P100 GPU only tomorrow before May 7 (On & after May 7, I can run @limo1996's code on those machines, if his company would permit him to contribute). As for testing on a AVX512 supporting laptop with a GPU, I can try to get my hands on one. Getting a new laptop from BestBuy & returning it after testing seems unethical, though. π
@imaginary-person I have it ready (there are todos for some intrinsics) but haven't received approval yet :((
Thanks for the update, @limo1996! Maybe your company sees it as helping the competition (Intel), and that too, by using company resources. TBH, I wouldn't blame them if they do.
@imaginary-person I am trying my best to get it..
(there are todos for some intrinsics)
@limo1996, can you please merge my implementation locally for those TODOs, if possible? Thanks!
How did you test, BTW, apart from modifying vec256_test_all_types.cpp
?
Hey @limo1996, these 4 tests are failing for AVX512 after a little progress I made today:
[ PASSED ] 253 tests.
[ FAILED ] 4 tests, listed below:
[ FAILED ] QuantizationTests/0.Quantize, for qint8
[ FAILED ] QuantizationTests/0.ReQuantizeFromInt, for qint8>
[ FAILED ] QuantizationTests/1.Quantize, for quint8
[ FAILED ] QuantizationTests/1.ReQuantizeFromInt, for quint8
Please let me know whether they're passing at your end. In that case, I wouldn't read up on quantization in order to debug these. Thanks!
EDIT: I had used incorrect syntax for an intrinsic, due to which these tests were failing.
@imaginary-person does _mm512_div_epi64
compiles for you? For me GCC complains that such function is not defined in intrinsic header file but I can see it here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_div_epi64&expand=5236,2130
Regarding tests I replaced Vec256 with Vec512 in vec256_test_all_types.cpp
. I think they are good enough as unit tests. I just don't know how to test kernels..
Thanks for your response, @limo1996.
does
_mm512_div_epi64
compiles for you? For me GCC complains that such function is not defined in intrinsic header file
EDIT: I'm using _mm512_div_pd
, but not _mm512_div_epi64
.
Regarding tests I replaced Vec256 with Vec512 in
vec256_test_all_types.cpp
.
Just FYI, one of the tests there has hardcoded inputs for 256 bit vectors.
Yes, it works for me. Which gcc version are you using, BTW? Some intrinsics are not present in older gcc versions.
Hmm. Ok I will try to update my gcc version and see..
Just FYI, one of the tests there has hardcoded inputs for 256 bit vectors.
Yeah I know. I hardcoded 512b vectors for now but this test needs to adjust to different vector lengths.
I hardcoded 512b vectors for now
I did so too! . I created another one with an #if defined(CPU_CAPABILITY_AVX512
:)
(as I'm using a unified namespace vec
for both vec256
& vec512
, and Vec256
& Vec512
have a common name Vectorize
, for now). How about using a unified namespace, BTW, as @VitalyFedyunin suggested?
Yeah I am using vec
as unified namespace and VecType
as currently used vector type.. What is your gcc version because even after update _mm512_div_epi64
is not found for me..
What is your gcc version
10.2.0
even after update
_mm512_div_epi64
is not found for me..
Sorry for the confusion, @limo1996! I was using github.com on my phone earlier. I just checked & am not using that intrinsic anywhere. Where do you need to use it, BTW?
I'm using mm512_div_pd
, though.
BTW, I just pushed updated code from yesterday.
Yeah it doesn't work for gcc 10.2 either..
Yeah it doesn't work for gcc 10.2 either..
It wouldn't work on any available gcc version. Where are you using it, BTW?
we might want to just kill AVX (no suffix), since most reasonable users will have AVX2.
@malfet @ezyang,
We'd disable AVX
support in ATen
, but should caffe2_perfkernels
still retain AVX support?
Thanks!
@limo1996, have you tried building locally on Windows (with Visual Studio), and on MacOS? I haven't, but CI builds fail for them, so I guess some additional changes would be required to be able to build in those environments. I don't have access to a Mac with AVX512 support, though.
BTW, the existing unit tests don't test BFloat16 support. We'd have to add it.
We'd disable AVX support in ATen, but should caffe2_perfkernels still retain AVX support?
Nyeh, it doesn't matter. Kill it if it makes your life easier, leave it alone if that's easier.
All 257 tests are passing for me now! I've pushed the code if you'd like to test locally.
@ezyang, no tests for BFloat16
were added when vec256_bfloat16.h
was initially added to the codebase by Intel folks. π
I'll add a few (only one or two are probably fine to check if conversions work fine, as the computations happen in float anyway).
BTW, I got access to an Intel Xeon Silver 4114 machine with an Nvidia GPU (to ensure @apaszke's concerns are addressed) sooner than expected, so I'll run benchmarks tomorrow.
@imaginary-person
My company did not give me permission so I will work on it but after working hours. I finally had time to look over your code and I think there are many similarities just overall I am having both Vec512
and Vec256
and unifying them in vec.h
as follows:
#if defined(USE_VEC512)
#include <ATen/cpu/vec512/vec512.h>
namespace at { namespace vec {
using at::vec512::maximum;
using at::vec512::minimum;
using at::vec512::fmadd;
using at::vec512::clamp_min;
using at::vec512::clamp_max;
using at::vec512::clamp;
using at::vec512::convert;
using at::vec512::convert_to_int32;
using at::vec512::int_same_size_t;
using at::vec512::cast;
using at::vec512::mask_gather;
using at::vec512::gather;
using at::vec512::convert_to_int_of_same_size;
using at::vec512::interleave2;
using at::vec512::deinterleave2;
}}
template <typename T>
using VecType = typename at::vec512::Vec512<T>;
#else
#include <ATen/cpu/vec256/vec256.h>
namespace at { namespace vec {
using at::vec256::maximum;
using at::vec256::minimum;
using at::vec256::fmadd;
using at::vec256::clamp_min;
using at::vec256::clamp_max;
using at::vec256::clamp;
using at::vec256::convert;
using at::vec256::convert_to_int32;
using at::vec256::int_same_size_t;
using at::vec256::cast;
using at::vec256::mask_gather;
using at::vec256::gather;
using at::vec256::convert_to_int_of_same_size;
using at::vec256::interleave2;
using at::vec256::deinterleave2;
}}
template <typename T>
using VecType = typename at::vec256::Vec256<T>;
#endif
What do you think about that?
I compiled and run your branch and 2 tests are failing. Namely: ProcessGroupMPITest
and vec_test_all_types_AVX512
.
These vector tests fail:
[ FAILED ] 6 tests, listed below:
[ FAILED ] Arithmetics/2.Multiplication, where TypeParam = at::vec::(anonymous namespace)::Vectorize<c10::complex<float> >
[ FAILED ] Arithmetics/2.Division, where TypeParam = at::vec::(anonymous namespace)::Vectorize<c10::complex<float> >
[ FAILED ] Arithmetics/3.Multiplication, where TypeParam = at::vec::(anonymous namespace)::Vectorize<c10::complex<double> >
[ FAILED ] Arithmetics/3.Division, where TypeParam = at::vec::(anonymous namespace)::Vectorize<c10::complex<double> >
[ FAILED ] QuantizationTests/0.Quantize, where TypeParam = at::vec::(anonymous namespace)::Vectorize<c10::qint8>
[ FAILED ] QuantizationTests/1.Quantize, where TypeParam = at::vec::(anonymous namespace)::Vectorize<c10::quint8>
For complex tests I see that issue is accuracy respectively no tolerance..
can you give me permision to push to your repo so we can collaborate there? I am going to review your code again more precisely and in case I find something I can create new branch and then PR to only_vec
branch. Thank you!
I am having both
Vec512
andVec256
and unifying them invec.h
as follows
Eventually, we'd have to use USE_AVX512
, just as you did, but as for the rest, it'd be great to get advice from @ezyang, @VitalyFedyunin, and @malfet. It looks good to me, though.
can you give me permision to push to your repo so we can collaborate there?
Sure! :) I just sent you an invite.
I compiled and run your branch and 2 tests are failing. Namely:
ProcessGroupMPITest
andvec_test_all_types_AVX512
. These vector tests fail:
Thanks for letting me know, @limo1996! I just verified that those 4 complex tests in vec_test_all_types_AVX512
pass with a DEBUG build, but fail with a non-debug build. All 257 tests pass with the debug build. I was wrong about it being a regression. Do you have any idea as to why that could happen? I'm yet to debug.
I fixed a typo earlier today on GitHub.com to fix the other 2 tests.
I'll look into ProcessGroupMPITest
. Thanks!
EDIT: This one's unrelated.
@ezyang
Currently, the following test of CI checks are failing for #56992, but I'm unable to reproduce the issues locally.
FWIW, the CircleCI machine running pytorch_macos_10_13_py3_test
runs MacOS on an old Xeon E5-2697 server with only AVX-1.0 support (But since the PR disables AVX, the default ATen CPU capability is being used instead).
CI check | Test name | Failure cause |
---|---|---|
pytorch_macos_10_13_py3_test | test_lkj_cholesky_log_prob | Expected value != Actual value |
Since this issue don't reproduce at my end, is it possible for you to grant me permission to debug this issue on CircleCI via SSH? Thank you!
UPDATE:
In #58379, I proved that with ATEN_CPU_CAPABILITY=default
, this test fails anyway, so it can simply be skipped on MacOS.
However, I'll create an issue to track it.
Hello @peterjc123, would you be able to help with Windows builds in #56992? Thank you!
Hello @ezyang, @malfet, @VitalyFedyunin, @limo1996
Please let me know if this quantization benchmark, based on one written by @jamesr66a, seems okay: https://gist.github.com/imaginary-person/111e9ece31a4c754bde3c41936cae496. The results are also at this link.
On a dual-socket Intel Xeon 6142, quint
& qint8
addition on AVX512 was significantly faster than those for AVX2.
However, for quint32
, AVX2
& AVX512
exhibit similar performance, with AVX512 being slightly faster.
For float
addition, though, AVX2 was a bit faster than AVX512, when multiple threads were used.
But I hadn't used taskset
to ensure that distinct physical cores were used.
By default, PyTorch only ensures that OMP_NUM_THREADS
is equal to the number of physical cores.
It doesn't, however, ensure that the threads are assigned to different physical cores.
A user can do so using taskset
, etc, so I think single-threaded performance numbers are more relevant here.
As an aside, if I turned off one NUMA node & hyperthreading, then float addition
became faster than it was earlier.
quantized
addition, however, became slower than earlier if one NUMA node & hyperthreading were turned off.
I'll try to dig deeper into the reasons for the results with perf
. Thanks!
The benchmark looks reasonable to me, but @jamesr66a would know best!
π Feature
Introduce Vec512 into pytorch to support AVX512
Motivation
In this issue I would like to discuss introduction of 512b vectors into ATen backend. In the private PyTorch fork of the company I am working for I successfully introduced and tested them (not for x86 architecture though). Therefore if there is a general consensus that Vec512 is needed I can try to add them.
Pitch
Mirror implementation of
Vec256
. i.e. addaten/src/ATen/cpu/vec512
folder with base classvec512_base.h
which will implement scalar version of the ops. Then I would add classes which vectorize base class for the specific vector element type i.e.vec512_double
etc. I would also add unifying header filevec.h
withVecType<>
aliasing eitherVec256
orVec512
depending on the cpu capability. Lastly I would replace all usages ofVec256
with genericVecType
and double check that all kernels (and tests) using it rely on the vector size and don't have it hardcoded.Let me know what you guys think! I am open to any ideas :)
Alternatives
As far as I know there is no way of using 512b vectors inside ATen backend right now.
Additional context