pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
82.73k stars 22.29k forks source link

AVX512 and Vec512 #56187

Open limo1996 opened 3 years ago

limo1996 commented 3 years ago

πŸš€ Feature

Introduce Vec512 into pytorch to support AVX512

Motivation

In this issue I would like to discuss introduction of 512b vectors into ATen backend. In the private PyTorch fork of the company I am working for I successfully introduced and tested them (not for x86 architecture though). Therefore if there is a general consensus that Vec512 is needed I can try to add them.

Pitch

Mirror implementation of Vec256. i.e. add aten/src/ATen/cpu/vec512 folder with base class vec512_base.h which will implement scalar version of the ops. Then I would add classes which vectorize base class for the specific vector element type i.e. vec512_double etc. I would also add unifying header file vec.h with VecType<> aliasing either Vec256 or Vec512 depending on the cpu capability. Lastly I would replace all usages of Vec256 with generic VecType and double check that all kernels (and tests) using it rely on the vector size and don't have it hardcoded.

Let me know what you guys think! I am open to any ideas :)

Alternatives

As far as I know there is no way of using 512b vectors inside ATen backend right now.

Additional context

limo1996 commented 3 years ago

@ezyang

ezyang commented 3 years ago

We're not fundamentally opposed to AVX512, but it will be difficult to determine when we can actually safely turn it on because on Intel it limits max turbo frequency and so can be a pessimization in some cases; additionally, sometimes the AVX512 FLOPS are about the same as AVX2 and so not an improvement either. That being said, I don't think it's a big problem if we start adding infrastructure for it, so yes, I think we'd accept patches for this.

cc @ngimel @malfet @VitalyFedyunin @dzhulgakov

limo1996 commented 3 years ago

Cool! I have 2 questions regarding the process.. Should I work on my fork or have a branch here? Should I create one huge pull request in the end or do it incrementally? Thanks.

ezyang commented 3 years ago

Incremental is better; you could use https://github.com/ezyang/ghstack/ (I may need to give you perms to push branches). Testing strategy is going to be the first and foremost question reviewers will have.

limo1996 commented 3 years ago

Ok. I will try to use ghstack for managing multiple depending PRs. You mean permit to push branches into pytorch right? Yes, can you please give me that perm so I can start working on it. Thanks. Regarding testing.. Unit tests are trivial since they already exist for Vec256 and they are generic enough to handle Vec512 if tweaked a bit. The more problematic will be to test kernels with Vec512 usage.. Here we can test that each kernel should produce the same output both for Vec256 and Vec512.. If there is some test suite for kernels then we don't need to do it.

ezyang commented 3 years ago

Open a dummy PR to pytorch repo first so CLA bot can ask you for a CLA, then I can give you rights.

Here we can test that each kernel should produce the same output both for Vec256 and Vec512.. If there is some test suite for kernels then we don't need to do it.

We have test suite for kernels but they're not going to test AVX512 by default without prodding.

limo1996 commented 3 years ago

Done

VitalyFedyunin commented 3 years ago

If you already have some code, could you please make DRAFT PR and CC me into it, so we can provide you with early feedback. Ideally, we would love to have vec256 become vec with the support of various architectures.

limo1996 commented 3 years ago

My idea was to have unifying header for vec256 and vec512 so in the end only vec is exposed. Something like:

#if defined(CPU_CAPABILITY_AVX512)
template <typename T>
using vec = typename at::vec512::Vec512<T>;
#elif defined(CPU_CAPABILITY_AVX2)
template <typename T>
using vec = typename at::vec256::Vec256<T>;
#endif 

where vec512 would implement the same fuctionality as vec256. What do you think?

I will push some code as soon as my company approves CLA..

imaginary-person commented 3 years ago

Hello @ezyang, even I wanted to work on it a few weeks back, but I searched old issues with strings such as AVX512 & vectorization, and found an issue which was closed stating PyTorch already supports AVX512, so I thought AVX512 vectorization was somehow supported implicitly. That's why I thought as a vectorization project, I could only enable float16 on AVX2 (and AVX512 implicitly). 😞

Enabling AVX512 isn't very time-consuming, as in most places, one can simply replace 256 with 512 & 32 with 64. For BFloat16, though, some newer intrinsics can potentially be used.

May I assist with this initiative? I'm very excited about it!

We're not fundamentally opposed to AVX512, but it will be difficult to determine when we can actually safely turn it on because on Intel it limits max turbo frequency and so can be a pessimization in some cases; additionally, sometimes the AVX512 FLOPS are about the same as AVX2 and so not an improvement either.

If workloads are mostly AVX512, then it'd indeed be faster. Also, as you already know, a complete cache line is used in AVX512 computations, so that makes computation more memory-bound than AVX2, so to speak. I think that's why Intel Ice Lake & Tiger Lake processors have 48 KB L1D caches instead of the normalized 32 KB, so AVX512 is even faster on them! Starting with Intel 11th gen, Intel is even churning out low-end laptop processors with AVX512 support, and even these have 48 KB L1D caches! So, it's probably time to jump on the AVX-512 bandwagon.

Intel HW prefetcher can't prefetch across page boundaries, so determining a prefetch distance & using prefetch compiler builtins (or assembly) can further speed up AVX512 computation in some cases.

ezyang commented 3 years ago

found an issue which was closed stating PyTorch already supports AVX512, so I thought AVX512 vectorization was somehow supported implicitly

@imaginary-person Huh, that's not my recollection, and is not consistent with #26109 and #4825 which are two older issues on the subject.

May I assist with this initiative? I'm very excited about it!

Sure! Though I suppose you and @limo1996 will have to figure out how to coordinate patches.

imaginary-person commented 3 years ago

@imaginary-person Huh, that's not my recollection, and is not consistent with #26109 and #4825 which are two older issues on the subject.

Ah, you're right! I had misread #51466. I missed an opportunity there 😞

Sure! Though I suppose you and @limo1996 will have to figure out how to coordinate patches.

Thank you! :)

limo1996 commented 3 years ago

@imaginary-person I am working on a draft PR. Will let you know once it's ready and we can start to iterate on it. Thanks!

VitalyFedyunin commented 3 years ago

We do use AVX512 when OneDNN kernels are called. However all point-wise operators (and everything else based on Vec256) are still on AVX2.

Bare in mind the fact of adding AVX512 kernels will require us either making it separate compile option or increasing binary size to ship 4 types of kernels simultaneously (with properly updated dispatch mechanics)

limo1996 commented 3 years ago

I think having some option like USE_VEC512 would be a good start.. Later on we can try to ship it simultaneously.

ezyang commented 3 years ago

@malfet suggests that we might want to just kill AVX (no suffix), since most reasonable users will have AVX2.

limo1996 commented 3 years ago

Makes sense to me. I currently don't see a reason for having CPU_CAPABILITY_AVX as it's always used in or with CPU_CAPABILITY_AVX2.

imaginary-person commented 3 years ago

@ezyang, I remember someone having posted an issue last month, in which they mentioned they were using an old Xeon server for production, that didn't have AVX2! You had also replied to that post. There might be more such users, so maybe a deprecation notice can be provided for the benefit of such users now, perhaps via a sticky post, and then a brief mention in the release notes upon release? Thanks!

ezyang commented 3 years ago

54684, you mean?

TBH, the user in question didn't even care about perf, it was a reproducibility problem. But sure, I made a post to https://dev-discuss.pytorch.org/t/dropping-avx-support-avx2-only/202

imaginary-person commented 3 years ago

Thank you, @ezyang! That's the one. They weren't concerned about performance, but they mentioned that the server not having AVX2 was of a cloud computing platform. So, I thought such users might suddenly encounter perf problems with a newer PyTorch release, if they'd continue using old machines.

imaginary-person commented 3 years ago

Ideally, we would love to have vec256 become vec with the support of various architectures.

@VitalyFedyunin @limo1996

I think that a cleaner way to have a single namespace vec is:

  1. Creating aten/src/ATen/cpu/vec with subdirectories vec256 and vec512.
  2. Renaming the vec256 namespace to vec. The files in vec512 directory would also have vec namespace.
  3. Renaming Vec256 (and Vec512) to a common name, say Vec, and then replacing Vec256 with Vec everywhere in the current codebase.
  4. The directory aten/src/ATen/cpu/vec would have vec.h and functional.h files, apart from subdirectories vec256 and vec512. For example, if AVX512 is to be enabled, vec.h can simply have the following content:
    #if defined(CPU_CAPABILITY_AVX512)
    #include <ATen/cpu/vec512/vec512.h>
    #else
    #include <ATen/cpu/vec/vec256/vec256.h>
  5. vec.h should be included wherever vec256.h has been included (apart from vec.h, and the files in the vec256 subdirectory, of course).

A draft PR (#56992) has been created to illustrate.

Please share your opinion. Thanks!

Bare in mind the fact of adding AVX512 kernels will require us either making it separate compile option or increasing binary size to ship 4 types of kernels simultaneously (with properly updated dispatch mechanics)

Thanks to the way compilation is being performed, this option works for shipping all types of kernels (No-vec, AVX2, and AVX512 since I disabled AVX in that draft PR).

imaginary-person commented 3 years ago

Please confirm which AVX-512 instructions sets should be initially supported. For instance, Caffe2 Perfkernels seem to use AVX512F, AVX512VL, and AVX512DQ instruction sets. The latest Intel Xeon server I've access to is actually an old one from Q3'17 (Intel Xeon Gold 6142), which supports AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL.

However, I don't have access to newer Intel (AMD currently doesn't seem to provision AVX512 support, but will do so in the near future) machines that support newer AVX512 instruction sets, such as the AVX512-VBMI category (eg. _mm512_permutex2var_epi8). Should we create some CPU capability flags for more recent AVX-512 instruction sets such as AVX512-VBMI as well, and use those flags to conditionally use different intrinsics, wherever appropriate? Thank you!

imaginary-person commented 3 years ago

Hey @limo1996, I added AVX512 support in ATen via vec512 in the draft PR #56992 (It's working locally but might take me some time to get things sorted on the CI infrastructure). It compiles with AVX512 instruction sets AVX512F, AVX512BW, AVX512DQ, and AVX512VL. Maybe we can combine this draft PR with the one you'd submit.

Right now, I'd like to focus on correctness, before focusing on optimization.

Hey @ezyang,

Testing strategy is going to be the first and foremost question reviewers will have.

We have test suite for kernels but they're not going to test AVX512 by default without prodding.

Can you please clarify which test suite you're referring to? Thanks! I'll modify vec256_test_all_types.cpp to test AVX512 as well & will rename this file to vec_test_all_types.cpp. Also, since both AVX2 & AVX512 use FMA, I was thinking of comparing their respective results for equality. Any advice is most welcome! :) Thank you!

imaginary-person commented 3 years ago

Hello @limo1996, can you please confirm if your company has permitted you to contribute to PyTorch (the ATen AVX512 implementation)?

Since it's been more than a week, is there a possibility that they might not approve? If so, do you think they'd allow you to at least contribute on incremental revisions?

BTW, do you have access to machines that support AVX512-VBMI, and other newer AVX512 instruction sets? The machines available for general-access at my school don't support the latest AVX512 instruction sets.

Thanks!

imaginary-person commented 3 years ago

@ezyang @VitalyFedyunin @limo1996

I think what might make such a PR easier to review for reviewers is annotated comments on the AVX512 intrinsics used that don't have corresponding AVX2 intrinsics.

imaginary-person commented 3 years ago

From @apaszke's old comment, it looks like we might have to test on machines with GPUs as well. Perhaps we can run benchmarks to address this concern.

I'd complete testing the implementation today, as I'd have access to a Xeon Silver 4114 machine with an Nvidia P100 GPU only tomorrow before May 7 (On & after May 7, I can run @limo1996's code on those machines, if his company would permit him to contribute). As for testing on a AVX512 supporting laptop with a GPU, I can try to get my hands on one. Getting a new laptop from BestBuy & returning it after testing seems unethical, though. πŸ˜„

limo1996 commented 3 years ago

@imaginary-person I have it ready (there are todos for some intrinsics) but haven't received approval yet :((

imaginary-person commented 3 years ago

Thanks for the update, @limo1996! Maybe your company sees it as helping the competition (Intel), and that too, by using company resources. TBH, I wouldn't blame them if they do.

limo1996 commented 3 years ago

@imaginary-person I am trying my best to get it..

imaginary-person commented 3 years ago

(there are todos for some intrinsics)

@limo1996, can you please merge my implementation locally for those TODOs, if possible? Thanks!

How did you test, BTW, apart from modifying vec256_test_all_types.cpp?

imaginary-person commented 3 years ago

Hey @limo1996, these 4 tests are failing for AVX512 after a little progress I made today:

[  PASSED  ] 253 tests.
[  FAILED  ] 4 tests, listed below:
[  FAILED  ] QuantizationTests/0.Quantize, for qint8
[  FAILED  ] QuantizationTests/0.ReQuantizeFromInt, for qint8>
[  FAILED  ] QuantizationTests/1.Quantize, for quint8
[  FAILED  ] QuantizationTests/1.ReQuantizeFromInt, for quint8

Please let me know whether they're passing at your end. In that case, I wouldn't read up on quantization in order to debug these. Thanks!

EDIT: I had used incorrect syntax for an intrinsic, due to which these tests were failing.

limo1996 commented 3 years ago

@imaginary-person does _mm512_div_epi64 compiles for you? For me GCC complains that such function is not defined in intrinsic header file but I can see it here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_div_epi64&expand=5236,2130

Regarding tests I replaced Vec256 with Vec512 in vec256_test_all_types.cpp. I think they are good enough as unit tests. I just don't know how to test kernels..

imaginary-person commented 3 years ago

Thanks for your response, @limo1996.

does _mm512_div_epi64 compiles for you? For me GCC complains that such function is not defined in intrinsic header file

EDIT: I'm using _mm512_div_pd, but not _mm512_div_epi64.

Regarding tests I replaced Vec256 with Vec512 in vec256_test_all_types.cpp.

Just FYI, one of the tests there has hardcoded inputs for 256 bit vectors.

limo1996 commented 3 years ago

Yes, it works for me. Which gcc version are you using, BTW? Some intrinsics are not present in older gcc versions.

Hmm. Ok I will try to update my gcc version and see..

Just FYI, one of the tests there has hardcoded inputs for 256 bit vectors.

Yeah I know. I hardcoded 512b vectors for now but this test needs to adjust to different vector lengths.

imaginary-person commented 3 years ago

I hardcoded 512b vectors for now

I did so too! . I created another one with an #if defined(CPU_CAPABILITY_AVX512 :) (as I'm using a unified namespace vec for both vec256 & vec512, and Vec256 & Vec512 have a common name Vectorize, for now). How about using a unified namespace, BTW, as @VitalyFedyunin suggested?

limo1996 commented 3 years ago

Yeah I am using vec as unified namespace and VecType as currently used vector type.. What is your gcc version because even after update _mm512_div_epi64 is not found for me..

imaginary-person commented 3 years ago

What is your gcc version

10.2.0

imaginary-person commented 3 years ago

even after update _mm512_div_epi64 is not found for me..

Sorry for the confusion, @limo1996! I was using github.com on my phone earlier. I just checked & am not using that intrinsic anywhere. Where do you need to use it, BTW?

I'm using mm512_div_pd, though. BTW, I just pushed updated code from yesterday.

limo1996 commented 3 years ago

Yeah it doesn't work for gcc 10.2 either..

imaginary-person commented 3 years ago

Yeah it doesn't work for gcc 10.2 either..

It wouldn't work on any available gcc version. Where are you using it, BTW?

imaginary-person commented 3 years ago

we might want to just kill AVX (no suffix), since most reasonable users will have AVX2.

@malfet @ezyang,

We'd disable AVX support in ATen, but should caffe2_perfkernels still retain AVX support? Thanks!

imaginary-person commented 3 years ago

@limo1996, have you tried building locally on Windows (with Visual Studio), and on MacOS? I haven't, but CI builds fail for them, so I guess some additional changes would be required to be able to build in those environments. I don't have access to a Mac with AVX512 support, though.

BTW, the existing unit tests don't test BFloat16 support. We'd have to add it.

ezyang commented 3 years ago

We'd disable AVX support in ATen, but should caffe2_perfkernels still retain AVX support?

Nyeh, it doesn't matter. Kill it if it makes your life easier, leave it alone if that's easier.

imaginary-person commented 3 years ago

All 257 tests are passing for me now! I've pushed the code if you'd like to test locally. @ezyang, no tests for BFloat16 were added when vec256_bfloat16.h was initially added to the codebase by Intel folks. 😞 I'll add a few (only one or two are probably fine to check if conversions work fine, as the computations happen in float anyway).

BTW, I got access to an Intel Xeon Silver 4114 machine with an Nvidia GPU (to ensure @apaszke's concerns are addressed) sooner than expected, so I'll run benchmarks tomorrow.

limo1996 commented 3 years ago

@imaginary-person My company did not give me permission so I will work on it but after working hours. I finally had time to look over your code and I think there are many similarities just overall I am having both Vec512 and Vec256 and unifying them in vec.h as follows:

#if defined(USE_VEC512)
  #include <ATen/cpu/vec512/vec512.h>
  namespace at { namespace vec {
  using at::vec512::maximum;
  using at::vec512::minimum;
  using at::vec512::fmadd;
  using at::vec512::clamp_min;
  using at::vec512::clamp_max;
  using at::vec512::clamp;
  using at::vec512::convert;
  using at::vec512::convert_to_int32;

  using at::vec512::int_same_size_t;
  using at::vec512::cast;
  using at::vec512::mask_gather;
  using at::vec512::gather;
  using at::vec512::convert_to_int_of_same_size;
  using at::vec512::interleave2;
  using at::vec512::deinterleave2;
  }}
  template <typename T>
  using VecType = typename at::vec512::Vec512<T>;
#else
  #include <ATen/cpu/vec256/vec256.h>
  namespace at { namespace vec {
  using at::vec256::maximum;
  using at::vec256::minimum;
  using at::vec256::fmadd;
  using at::vec256::clamp_min;
  using at::vec256::clamp_max;
  using at::vec256::clamp;
  using at::vec256::convert;
  using at::vec256::convert_to_int32;

  using at::vec256::int_same_size_t;
  using at::vec256::cast;
  using at::vec256::mask_gather;
  using at::vec256::gather;
  using at::vec256::convert_to_int_of_same_size;
  using at::vec256::interleave2;
  using at::vec256::deinterleave2;
  }}
  template <typename T>
  using VecType = typename at::vec256::Vec256<T>;
#endif

What do you think about that?

I compiled and run your branch and 2 tests are failing. Namely: ProcessGroupMPITest and vec_test_all_types_AVX512. These vector tests fail:

[  FAILED  ] 6 tests, listed below:
[  FAILED  ] Arithmetics/2.Multiplication, where TypeParam = at::vec::(anonymous namespace)::Vectorize<c10::complex<float> >
[  FAILED  ] Arithmetics/2.Division, where TypeParam = at::vec::(anonymous namespace)::Vectorize<c10::complex<float> >
[  FAILED  ] Arithmetics/3.Multiplication, where TypeParam = at::vec::(anonymous namespace)::Vectorize<c10::complex<double> >
[  FAILED  ] Arithmetics/3.Division, where TypeParam = at::vec::(anonymous namespace)::Vectorize<c10::complex<double> >
[  FAILED  ] QuantizationTests/0.Quantize, where TypeParam = at::vec::(anonymous namespace)::Vectorize<c10::qint8>
[  FAILED  ] QuantizationTests/1.Quantize, where TypeParam = at::vec::(anonymous namespace)::Vectorize<c10::quint8>

For complex tests I see that issue is accuracy respectively no tolerance..

can you give me permision to push to your repo so we can collaborate there? I am going to review your code again more precisely and in case I find something I can create new branch and then PR to only_vec branch. Thank you!

imaginary-person commented 3 years ago

I am having both Vec512 and Vec256 and unifying them in vec.h as follows

Eventually, we'd have to use USE_AVX512, just as you did, but as for the rest, it'd be great to get advice from @ezyang, @VitalyFedyunin, and @malfet. It looks good to me, though.

can you give me permision to push to your repo so we can collaborate there?

Sure! :) I just sent you an invite.

I compiled and run your branch and 2 tests are failing. Namely: ProcessGroupMPITest and vec_test_all_types_AVX512. These vector tests fail:

Thanks for letting me know, @limo1996! I just verified that those 4 complex tests in vec_test_all_types_AVX512 pass with a DEBUG build, but fail with a non-debug build. All 257 tests pass with the debug build. I was wrong about it being a regression. Do you have any idea as to why that could happen? I'm yet to debug.

I fixed a typo earlier today on GitHub.com to fix the other 2 tests.

I'll look into ProcessGroupMPITest. Thanks! EDIT: This one's unrelated.

imaginary-person commented 3 years ago

@ezyang

Currently, the following test of CI checks are failing for #56992, but I'm unable to reproduce the issues locally. FWIW, the CircleCI machine running pytorch_macos_10_13_py3_test runs MacOS on an old Xeon E5-2697 server with only AVX-1.0 support (But since the PR disables AVX, the default ATen CPU capability is being used instead).

CI check Test name Failure cause
pytorch_macos_10_13_py3_test test_lkj_cholesky_log_prob Expected value != Actual value

Since this issue don't reproduce at my end, is it possible for you to grant me permission to debug this issue on CircleCI via SSH? Thank you!

UPDATE: In #58379, I proved that with ATEN_CPU_CAPABILITY=default, this test fails anyway, so it can simply be skipped on MacOS. However, I'll create an issue to track it.

imaginary-person commented 3 years ago

Hello @peterjc123, would you be able to help with Windows builds in #56992? Thank you!

imaginary-person commented 3 years ago

Hello @ezyang, @malfet, @VitalyFedyunin, @limo1996

Please let me know if this quantization benchmark, based on one written by @jamesr66a, seems okay: https://gist.github.com/imaginary-person/111e9ece31a4c754bde3c41936cae496. The results are also at this link.

On a dual-socket Intel Xeon 6142, quint & qint8 addition on AVX512 was significantly faster than those for AVX2. However, for quint32, AVX2 & AVX512 exhibit similar performance, with AVX512 being slightly faster.

For float addition, though, AVX2 was a bit faster than AVX512, when multiple threads were used. But I hadn't used taskset to ensure that distinct physical cores were used. By default, PyTorch only ensures that OMP_NUM_THREADS is equal to the number of physical cores. It doesn't, however, ensure that the threads are assigned to different physical cores. A user can do so using taskset, etc, so I think single-threaded performance numbers are more relevant here.

As an aside, if I turned off one NUMA node & hyperthreading, then float addition became faster than it was earlier. quantized addition, however, became slower than earlier if one NUMA node & hyperthreading were turned off.

I'll try to dig deeper into the reasons for the results with perf. Thanks!

ezyang commented 3 years ago

The benchmark looks reasonable to me, but @jamesr66a would know best!