shibatch / sleef

SIMD Library for Evaluating Elementary Functions, vectorized libm and DFT
https://sleef.org
Boost Software License 1.0
628 stars 126 forks source link

A few API functions were (accidentally?) removed in 3.6 #534

Closed musicinmybrain closed 2 months ago

musicinmybrain commented 5 months ago

Although sleef 3.6 is supposed to be an ABI-compatible update for 3.5.1 (the SONAME version is still 3), comparing the main shared library using fedabipkgdiff from libabigail shows that a handful of symbols were removed. Of these, Sleef_sincospil_u05 and Sleef_sincospil_u35 were documented API functions.

Comparing the ABI of binaries between sleef-3.5.1-33.fc41.x86_64.rpm and sleef-3.6-1.fc41.x86_64.rpm:

================ changes of 'libsleef.so.3.5.1'===============
  Functions changes summary: 4 Removed, 67 Changed (1796 filtered out), 148 Added functions
  Variables changes summary: 0 Removed, 0 Changed, 0 Added variable

  4 Removed functions:

    [D] 'function const Sleef_longdouble2 Sleef_sincospil_u05(long double)'    {Sleef_sincospil_u05}
    [D] 'function const Sleef_longdouble2 Sleef_sincospil_u35(long double)'    {Sleef_sincospil_u35}
    [D] 'function const Sleef_quad2 Sleef_sincospiq_u05(Sleef_quad)'    {Sleef_sincospiq_u05}
    [D] 'function const Sleef_quad2 Sleef_sincospiq_u35(Sleef_quad)'    {Sleef_sincospiq_u35}

This is on x86_64; I haven’t checked other architectures.

In a quick glance at the source code, it wasn’t immediately obvious to me why these symbols disappeared. I do notice that they are all at the end of rename.h, after a comment line:

https://github.com/shibatch/sleef/blob/a99491afee2bae0b11e9ffbf3211349f43a5fd10/src/libm/rename.h#L175-L181

While nothing in rename.h has changed between 3.5.1 and 3.6, maybe this was affected by changes in mkrename.c?

musicinmybrain commented 5 months ago

I bisected this to 1d66bbdadfb4ade1f09fd290a2f535cc59dae8aa.

The changes in sleeflibm_header.h.org.in look suspicious:

https://github.com/shibatch/sleef/commit/1d66bbdadfb4ade1f09fd290a2f535cc59dae8aa#diff-6a6c4d54f858f55978b29ae41b00d4a8b484ddf7908a136abd4f9a003458d2c3L312

blapie commented 5 months ago

Good point. It seems that scalar long double/quad sincospi implementation was removed along with the long double DFT (where it was used). I agree that this should have triggered a new major version. I would be keen to re-integrate the function as it can still be useful to users.

Maybe a bit of background from @shibatch could help?

What would be the process to fix 3.6? Would we simply have to create a new patch release?

I will be officially AFK until April 22 (sabbatical), but @joanaxcruz, @joeramsay and @shibatch might be able to follow up on this one.

musicinmybrain commented 4 months ago

I’m going to hold off on updating the Fedora package past 3.5.1 for up to another month or two, in the hope that it will still be possible to avoid an ABI break (and update Sleef in existing stable releases).

blapie commented 3 months ago

Hi! Sorry about the delay in responding to that. We are planning a patch release including a fix for that in the next month. Will post more in discussions.

musicinmybrain commented 3 months ago

Thanks for the update!

blapie commented 2 months ago

@musicinmybrain We are about to merge #545 which should restore the missing symbols. Please re-open issue if needed, it should close automatically on merge.

musicinmybrain commented 2 months ago

I can confirm that applying https://github.com/shibatch/sleef/pull/545.patch to the 3.6 release restores the missing functions, at least on x86_64 – I didn’t take the time to test other architectures.

blapie commented 2 months ago

Thanks! This should be cross platform, tested it on both x86 and aarch64.

musicinmybrain commented 2 months ago

Since PyTorch still needed a const-related change for sleef 3.6 (basically https://github.com/pytorch/pytorch/commit/3c2b5d4b02c45b169a3f1e4f53235cc6d5829711, but actually implemented and merged as a much larger commit https://github.com/pytorch/pytorch/commit/56451cd49d9cf94b49197e09dec13426bb1a5370), I think I’m still going to keep stable Fedora releases (F40, F39) and EPELs (EPEL9/8/7) at 3.5.1 as a precaution – but I see no obstacle to updating Rawhide to a patched 3.6 or to the upcoming 3.6.1, for inclusion in Fedora 41 and later.

blapie commented 2 months ago

Sounds sensible! Do you know if there is anything we can/should do to solve the PyTorch issue? And should that be integrated to 3.6.1 or 3.7?

musicinmybrain commented 2 months ago

Sounds sensible! Do you know if there is anything we can/should do to solve the PyTorch issue? And should that be integrated to 3.6.1 or 3.7?

To be honest, I haven’t studied this closely enough to give really high-quality advice. However, at a glance, from the compiler errors without the PyTorch change, e.g.

In file included from /builddir/build/BUILD/python-torch-2.3.0-build/pytorch-v2.3.0/aten/src/ATen/cpu/vec/vec256/vec256.h:12,
                 from /builddir/build/BUILD/python-torch-2.3.0-build/pytorch-v2.3.0/aten/src/ATen/cpu/vec/vec.h:6,
                 from /builddir/build/BUILD/python-torch-2.3.0-build/pytorch-v2.3.0/aten/src/ATen/cpu/vec/functional_base.h:6,
                 from /builddir/build/BUILD/python-torch-2.3.0-build/pytorch-v2.3.0/aten/src/ATen/cpu/vec/functional.h:3,
                 from /builddir/build/BUILD/python-torch-2.3.0-build/pytorch-v2.3.0/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp:12,
                 from /builddir/build/BUILD/python-torch-2.3.0-build/pytorch-v2.3.0/build/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp.AVX2.cpp:1:
/builddir/build/BUILD/python-torch-2.3.0-build/pytorch-v2.3.0/aten/src/ATen/cpu/vec/vec256/vec256_bfloat16.h: In instantiation of ‘at::vec::AVX2::Vectorized<T> at::vec::AVX2::Vectorized16<T>::log() const [with T = c10::BFloat16]’:
/builddir/build/BUILD/python-torch-2.3.0-build/pytorch-v2.3.0/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp:141:3:   required from here
  156 |                 return (x_vec / (kOneVec - x_vec)).log();
      |                        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
/builddir/build/BUILD/python-torch-2.3.0-build/pytorch-v2.3.0/aten/src/ATen/cpu/vec/vec256/vec256_bfloat16.h:482:16: error: invalid conversion from ‘__m256 (*)(__m256)’ to ‘const __m256 (*)(__m256)’ [-fpermissive]
  482 |     return map(Sleef_logf8_u10);
      |                ^~~~~~~~~~~~~~~
      |                |
      |                __m256 (*)(__m256)
/builddir/build/BUILD/python-torch-2.3.0-build/pytorch-v2.3.0/aten/src/ATen/cpu/vec/vec256/vec256_bfloat16.h:268:42: note:   initializing argument 1 of ‘at::vec::AVX2::Vectorized<T> at::vec::AVX2::Vectorized16<T>::map(const __m256 (*)(__m256)) const [with T = c10::BFloat16; __m256 = __m256]’
  268 |   Vectorized<T> map(const __m256 (*const vop)(__m256)) const {
      |                     ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~

it looks like using __attribute__ ((const)) instead of declaring the return values const changed the signatures of a number of functions so that function-pointer declarations in PyTorch, e.g.

https://github.com/pytorch/pytorch/blob/8dd1be49b7c17223ed6de18fac4471abc6735aae/aten/src/ATen/cpu/vec/vec256/vec256_bfloat16.h#L268

became incompatible. Now they look like:

https://github.com/pytorch/pytorch/blob/543a870943120484db547382ed9ca9538a40f284/aten/src/ATen/cpu/vec/vec256/vec256_bfloat16.h#L308

Presumably this would affect anything that’s taking pointers to sleef functions without using C++/C23 auto or similar to declare their types.

It’s unfortunate that this technically changed the API, because __attribute__ ((const)) makes much more sense.

If you try to fix this in sleef somehow, it’s probably a good idea to make sure to examine the PyTorch workaround and/or coordinate with the PyTorch people to make sure that the fix doesn’t break PyTorch again.