[CUDA] [SYCL} --fast-math causes nvptx codegen error and/or failed llvm link

lfmeadow commented 1 year ago

Describe the bug The -ffast-math switch results in backend failures and/or llvm link failures when using some double precision std::math intrinsics. This was discovered compiling LAMMPS with Kokkos using SYCL for CUDA.

To Reproduce

ogin29:jack$ cat exp.cpp
#include <iostream>
#include <cstdlib>
#include <sycl/sycl.hpp>
using namespace sycl;
#include <cmath>

// Create an exception handler for asynchronous SYCL exceptions
static auto exception_handler = [](sycl::exception_list e_list) {
  for (std::exception_ptr const &e : e_list) {
    try {
      std::rethrow_exception(e);
    }
    catch (std::exception const &e) {
#if _DEBUG
      std::cout << "Failure" << std::endl;
#endif
      std::terminate();
    }
  }
};

int
main()
{
  auto d_selector{default_selector_v};
  const int N = 1024;
  std::vector<double> in(N), out(N);
  std::srand(1234);
  for (int i = 0; i < N; ++i)
    in[i] = std::rand() / (double) RAND_MAX;
  double *d_in = in.data(), *d_out = out.data();
  queue q(d_selector, exception_handler);
  range num_items{N};
  auto e =
    q.parallel_for(num_items, [=](auto i) {
      //d_out[i] = std::exp(std::sin(d_in[i]) + std::cos(d_in[i]));
      d_out[i] = std::sin(d_in[i]) + std::cos(d_in[i]);
    });
  e.wait();
  std::cout << out[0];
}
=====
clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_80 --cuda-path=$CUDATOOLKIT_HOME exp.cpp -o exp -ffast-math
clang-16: warning: CUDA version 11.7 is only partially supported [-Wunknown-cuda-version]
fatal error: error in backend: Cannot select: t11: f64 = fsin nnan ninf nsz arcp contract afn reassoc t10
  t10: f64,ch = load<(load (s64) from %ir.arrayidx.i.i, !tbaa !65, addrspace 1)> t0, t7, undef:i64
    t7: i64 = add t2, t6
      t2: i64,ch = CopyFromReg t0, Register:i64 %1
        t1: i64 = Register %1
      t6: i64 = shl t4, Constant:i32<3>
        t4: i64,ch = CopyFromReg t0, Register:i64 %2
          t3: i64 = Register %2
        t19: i32 = Constant<3>
    t9: i64 = undef
In function: _ZTSN4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZ4mainEUlT_E_EE
llvm-foreach:
clang-16: error: clang frontend command failed with exit code 70 (use -v to see invocation)
clang version 16.0.0 (https://github.com/intel/llvm d164fd946341dba28d6759aa2938161ce0e83647)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /pscratch/sd/l/lfmeadow/llvm-build/install/bin
clang-16: note: diagnostic msg: Error generating preprocessed source(s).

If std::exp is called then a different message appears.

Environment (please complete the following information):

linux nvidia A100 CUDATOOLKIT_HOME=/opt/nvidia/hpc_sdk/Linux_x86_64/22.5/cuda/11.7 on Perlmutter.

Additional context Add any other context about the problem here.

zjin-lcf commented 1 year ago

Are there errors when std:: is replaced with sycl:: ?

lfmeadow commented 1 year ago

It links with sycl::. I'll see if I can patch LAMMPS. This is gonna be a real pain, the way Kokkos is done. I can hack it but I don't know how to do it properly. Isn't this just a compiler bug?

lfmeadow commented 1 year ago

I took page from Ruyman //github.com/Ruyk/lammps.git commit d74d7cfd5f1aedf9dfad57b8b3412802fbb3263f and just brute forced sin, cos, pow, exp, and sqrt to use the Kokkos::Experimental namespace. So I guess this issue has always been there. It seems very onerous for the user.

zjin-lcf commented 1 year ago

The same issue was reported in https://github.com/intel/llvm/issues/7344

ldrumm commented 1 year ago

reduced. I'll take a look at this

target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

define dso_local void @foo(double* %f) {
entry:
  %0 = call double @llvm.sin.f64(double 0x7FF8000000000000)
  store double %0, double* %f, align 8
  ret void
}

declare double @llvm.sin.f64(double) #0

attributes #0 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }

zjin-lcf commented 1 year ago

SYCL with HIP support sees similar issues. Thanks.

ldrumm commented 1 year ago

This is pretty interesting: CUDA doesn't generate code that would contain @llvm.sin.f64 by the time it gets to the backend. However, in normal C++ (and I'm including SYCL in that definition) most of the mathematical functions will end up being lowered to an llvm intrinsic (for sin, it's @llvm.sin.f(32|64)). The NVPTX backend has a couple of instruction definitions for these ISD nodes but it seems that there are significant gaps in the IselLowering implementation, and lots of missing patterns.

There's even this strange loop:

  for (const auto &Op :
       {ISD::FDIV, ISD::FREM, ISD::FSQRT, ISD::FSIN, ISD::FCOS, ISD::FABS}) {
...
    setOperationAction(Op, MVT::f64, Legal);
...

This loop tells the ISel Lowering that f64 is supported in hardware for sin, cos and abs ISD opcodes (among others). However, there is no PTX instruction that matches that behaviour because the PTX ISA only supports sin for .f32. Of course there's no tablegen pattern, or custom lowering defined because the PTX assembler would rightly flip.

I've started collating a list of the maths functions that should generate direct PTX ISA instructions, and those that require lowering to libcall, both dependent on whether we accept lower precision -ffast-math I'll post a patch to upstream llvm soon.

Thanks for the great reproducer, @lfmeadow

intel / llvm

[CUDA] [SYCL} --fast-math causes nvptx codegen error and/or failed llvm link #7954