intel / llvm

Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.
Other
1.26k stars 739 forks source link

[SYCL] Passing global variable to kernel lambda compiles, produces confusing results #3737

Closed jasonsewall-intel closed 7 months ago

jasonsewall-intel commented 3 years ago

SYCL 2020 (sec 5.4) says:

Variables with static storage duration that are odr-used inside a device function, must be const or constexpr and zero-initialized or constant-initialized

It seems possible to use a variable with static storage duration inside a device function, and furthermore running it yields unexpected results.

To Reproduce

  1. global.cpp:
    
    #include <CL/sycl.hpp>
    #include <cstdio>

using namespace cl::sycl;

int i=-1; int main() { i = 11; queue Q; int *x = malloc_shared(100,Q); Q.parallel_for(10, [=](auto g) { x[g]= i; }); Q.wait(); printf("%d\n", x[0]); return 0; }


2 .compile with:
`$ dpcpp global.cpp -o global`

3. run:

$ ./global -1



Based on the SYCL 2020 spec, I would expect this to fail to compile. Compiling as it does, I would expect it to print 11, which is the last value assigned to `i` on the host before launching the kernel.

**Environment (please complete the following information):**

- OS: Linux, Ubuntu 18.04
- Target device and vendor: Intel GPU (Gen9)
- DPC++ version: dpcpp 2021.2.0.20210317
- Dependencies version: oneAPI r2

**Additional context**
jasonsewall-intel commented 3 years ago

I'll add that I can reproduce this behavior with this nightly: https://github.com/intel/llvm/releases/tag/sycl-nightly%2F20210512

erichkeane commented 3 years ago

There is likely a feature request here to implement the global limitation, as well as perhaps a bug for why 'i' evaluated that way (which would be consistent with lambdas in general). I likely don't have a chance to look at this anytime soon, but if someone else can evaluate this, it would be helpful.

A preprocessed and reduced version that provides as little of the headers as possible would also be most beneficial to figuring out why itis printing -1.

jasonsewall-intel commented 3 years ago

Any suggestions on how to reduce the preprocessed version? Looks like it's about 222k lines long.

erichkeane commented 3 years ago

I've used creduce successfully in the past.

zjin-lcf commented 3 years ago

Using SYCL buffers shows the following error message at compile time:

SYCL kernel cannot use a non-const global variable

"I would expect this to fail to compile"

jasonsewall-intel commented 3 years ago

@zjin-lcf can you elaborate on your comment? How do SYCL buffers figure into this?

jasonsewall-intel commented 3 years ago

@erichkeane My preprocessed file doesn't compile at all, presumably because of device/2-pass magic. Can you share your workflow for extracting preprocessed files that then compile?

zjin-lcf commented 3 years ago

Could you please run the following program ? I just added my observation after building the buffer-style program in my reply. Compiler people will know the cause for your USM-style program. Thanks for the example.

#include <CL/sycl.hpp>
#include <cstdio>
using namespace cl::sycl;

int i=-1;
int main() {
 //i = 11;
 queue q;
 int* x = (int*) malloc (100*sizeof(int));

 {
 buffer<int, 1> d_x(x, 100);

 q.submit([&] (handler &cgh) {
    auto x = d_x.get_access<access::mode::discard_write>(cgh);
    cgh.parallel_for<class test>(nd_range<1>(range<1>(128), range<1>(64)), [=] (nd_item<1> item) {
      int n = item.get_global_id(0);
      if (n < 100) x[n] = i;
       });
    });
}
 printf("%d %d\n", x[0]);
 free(x);
 return 0;
}
erichkeane commented 3 years ago

@erichkeane My preprocessed file doesn't compile at all, presumably because of device/2-pass magic. Can you share your workflow for extracting preprocessed files that then compile?

There IS some funny business that I think @mdtoguchi is just managing recently, but I find that if you open the preprocessed file and remove all but the 1st 'version', you can do a device-only compile that gets you everything you'd want.

mdtoguchi commented 3 years ago

there were a few updates made to preprocessing recently (couple of weeks ago). Basically, the generated preprocessed file is a fat preprocessed file (has both device and host preprocessed files). The compiler should be able to take that and will unbundle it accordingly as long as the -fsycl option is passed. Just be sure that the compiler knows it's a preprocessed file by using the .ii extension or using -x c++-cpp-output.

jasonsewall-intel commented 3 years ago

Well, I don't know if I need to file another issue or what, but here's what I'm doing:

$ clang++  -fsycl -fsycl-unnamed-lambda -P -E global.cpp  -o preprocessed.ii
$ clang++ -fsycl -fsycl-unnamed-lambda -w preprocessed.ii -o global

And the second call produces the following errors (among others):

/tmp/preprocessed-add58c.ii:35274:16: error: 'posix_memalign' is missing exception specification 'throw()'
extern "C" int posix_memalign(void **__memptr, size_t __alignment, size_t __size);
               ^
                                                                                  throw()
/tmp/preprocessed-add58c.ii:19314:12: note: previous declaration is here
extern int posix_memalign (void **__memptr, size_t __alignment, size_t __size)
           ^
/tmp/preprocessed-add58c.ii:77299:3: error: constexpr function never produces a constant expression [-Winvalid-constexpr]
  acos(float __x)
  ^
/tmp/preprocessed-add58c.ii:77300:12: note: subexpression not valid in a constant expression
  { return __builtin_acosf(__x); }
           ^
/tmp/preprocessed-add58c.ii:77303:3: error: constexpr function never produces a constant expression [-Winvalid-constexpr]
  acos(long double __x)

So what should we do to get to the bottom of this?

erichkeane commented 3 years ago

On the first error, that is strange! Apparently posix_memalign is defined in 2 different ways. Can you manually add that in your file?

For the constexpr issues, you can just add -Wno-invalid-constexpr to remove those.

That said, and perhaps @mdtoguchi can explain: It appears that the ' line markers' are missing. Both of those things look like errors that would be ignored in the 'header', but it seems to think they are in the preprocessed file directly.

jasonsewall-intel commented 3 years ago

That said, and perhaps @mdtoguchi can explain: It appears that the ' line markers' are missing. Both of those things look like errors that would be ignored in the 'header', but it seems to think they are in the preprocessed file directly.

Oho, that's the issue! I was using -P -E, because I would normally strip the line markers for this sort of thing. But clearly, they are important! When I remove the -P option from the first command, it compiles. I'll set creduce to work on it.

erichkeane commented 3 years ago

Ah! I missed that in the command line you sent. Yes, that would explain it :) The compiler suppresses all warnings in the 'system headers' as well as SOME errors (particularly with builtins, we aren't as strict at matching and assume the standard headers do what the standard says), so they are often pretty useful.

It'll slow CReduce down slightly, since it will have to keep line markers around until all the 'unused' code is removed, but that seems to be one of the faster bits to get through anyway.

jasonsewall-intel commented 3 years ago

creduce's output (finally!):

// __CLANG_OFFLOAD_BUNDLE____START__ sycl-spir64-unknown-unknown-sycldevice
namespace std {
  template<bool, typename >
    struct a
    ;
  template<typename b>
    struct a<true, b>
    { typedef b c; };
  template<bool d, typename b = void>
    using e = typename a<d, b>::c;
  }
typedef int f;
void  printf (...)
;
namespace cl {
namespace sycl {
namespace detail {
template <bool g, class h >
using e = typename std::a<g, h>::c;
}
template <int = 1> class m {
public:
  m(detail::e<1, f> )  ;
};
namespace detail {
template <typename i > struct aa {
  using ab = i;
};
}
class handler {
  template <typename j, typename k, int ac>
  void ad(m<ac> ,
                                k l) {
    using r =
        int;
    using n =
        typename detail::aa<j>::ab;
     ae<n, r>(l);
  }
  template <typename , typename , typename k>
  __attribute__((sycl_kernel)) void
  o(k l) {
    l(0);
  }
  template <typename j, typename , typename k>
  std::e<!0>
  ae(k l) {
    o<j, int>(l);
  }
public:
  template <typename j , typename k>
  void parallel_for(m<> af, k l) {
    ad<j>(af, l);
  }
};
class queue {
public:
  void wait() ;
  template <typename j , typename k>
  void  parallel_for(m<> af,
                      k l ) {
    p<j>(af, l);
  }
  template <typename j , typename k,
            int ac>
  void  p(
      m<ac> af, k l) {
        [&](handler ag) {
          ag.parallel_for<j>(af,
                                                            l);
        };
  }
};
template <typename h> h *malloc_shared(f , queue ) ;
}
}using namespace cl::sycl;

int i=-1;
int main() {
 i = 11;
 queue Q;
 int *x = malloc_shared<int>(100,Q);
 Q.parallel_for<class mykern>(10, [=](auto g) {
  x[g]= i;
 });
 Q.wait();
 printf("%d\n", x[0]);
 return 0;
}

// __CLANG_OFFLOAD_BUNDLE____END__ sycl-spir64-unknown-unknown-sycldevice

// __CLANG_OFFLOAD_BUNDLE____START__ host-x86_64-unknown-linux-gnu
# 1 "" 3
inline namespace cl {
namespace sycl {
namespace detail {
enum  {
  q = 3};
struct kernel_param_desc_t {
  int ah;
};
template <class > struct w ;
}
}
}
class mykern;
namespace cl { namespace sycl { namespace detail {
kernel_param_desc_t ai[] {
  {},
  {},
   q};
template <> struct w<mykern> {
  static char* s() { return "_ZTSZ4mainE6mykern"; }
  static unsigned t() { return 1; }
  static kernel_param_desc_t z(unsigned ) {
    return ai[2];
  }
  static bool aj() ;
};
}
}
}
namespace std
{
  inline namespace __cxx11 {}
  template< int ak>
    struct al
    {
      static constexpr int am = ak;
    };
  template<bool ak>
    using an = al< ak>;
  template<typename >
    struct aq
    :al<false> {};
  template<typename >
    struct ap
    ;
  template<typename >
    struct as
    ;
  template<typename  >
    struct ar
    ;
    template<bool, typename = void>
    struct a
    ;
  template<typename b>
    struct a<true, b>
    { typedef b c; };
  template<int  , typename , typename >
    struct ao
  ;
  template< 
     typename = void>
    struct bd : al<true>
    {};
  template<typename b>
    b*
    bg(b ) { return ; }
  template<typename >
    class allocator;
template<class >
    struct char_traits;
namespace __cxx11 {
  template<typename aw, typename = char_traits<aw>,
           typename = allocator<aw> >
    class basic_string;
}
}
typedef long unsigned f;
 void* operator new(unsigned long, void* ) ;
    struct at
    {
      using au = int*;
      using bk = f
      ;
    };
namespace std {
namespace __cxx11 {
  template<typename , typename , typename >
    class basic_string
    {
      struct  {};
      at::bk av;
      enum { u = 15 };
 char ba[u ];
    public:
      void 
      operator=(const char* )
      ;
    };
}
}
extern "C" 
void  printf (...);
namespace std {
  template< typename >
    struct ax;
  template<long , typename b>
    using az = typename ax< b>::c;
  template<typename , typename br, typename... ay>
    void 
    bs( br bb, ay&&... bj)
    { bb(bj...); }
  template<typename , typename bc, typename... ay>
    void 
    be(bc&& bx, ay&&... bj)
    {
      using bf =  f;
 bs<bf>( bx,
     bj...);
    }
  template< typename bh>
    struct cc{
      static bh&
      bp(cc& bi) { return bi.cf; }
      bh cf;
    };
  template<unsigned long , typename >
    struct cg;
  template<int bu, typename bh>
    struct cg<bu, bh>
    : cc< bh>
    {};
  template<typename... ci>
    class cj : public cg<0, ci...>
    {};
  template<typename bh>
    struct ax< cj<bh> >
    {
      typedef bh c;
    };
  template<unsigned long bo, typename bh>
    bh&
    ca(cg<bo, bh>& bl) { return cg<bo, bh>::bp(bl); }
  template<int bo, typename... ci>
    az<bo, cj<ci...>>&
    get(cj<ci...>& bl) { return ca(bl); }
 template<typename >
    struct cm
    ;
  template <typename b>
    class v
    {
      template <typename cd >
 struct co
 {
   using c = cd*;
 };
    public:
      using au = typename co<b>::c;
      au& x() { return get<0>(bv); }
      void bq(au ch) {
 x() = ch;
      }
      cj<au> bv;
    };
  template <typename b, typename = cm<b>>
    class bz
    {
      v<b> bv;
    public:
      using au = typename v<b>::au;
      au
      operator->() { return bv.x(); }
      void
      bq(au ch ) {
 bv.bq(ch);
      }
  };
    class bt
    {
      using cu = int;
      cu* x;
      int cx;
    };
    class cy : bt
    {
 cy(cy& ) ;
    };
}
namespace cl {
namespace sycl {
namespace detail {
template <bool g, class h >
using e = typename std::a<g, h>::c;
template <bool , class h, class >
using cz = h;
}
}
}
namespace sycl {
namespace detail {
struct code_location {
  static code_location
  current() {
    return ;
  }
};
}
}
namespace std {
  class ck;
  struct db
  {
    void (ck::*c)();
  };
  struct  bw
  {
    void* de() { return &_M_pod_data; }
    template<typename b>
      b&
      de() const
      {
      void *y; return *static_cast< b*>(y); }
    db _M_pod_data;
  };
  template<typename >
    class function;
  class by
  {
  public:
    template<typename bm>
      class dh
      {
      protected:
 static bm*
 cb(const bw& cn)
     {
        bm bb = cn.de<bm>();
       return bg(bb);
     }
      public:
 static void
 dl(bw& dm, bm bb)
 { new (dm.de()) bm(bb); }
      };
    bw ce;
    bool cp;
  };
  template<typename , typename >
    class cr;
  template<typename cq, typename bm, typename... bn>
    class cr<cq(bn...), bm>
    : public by::dh<bm>
    {
    public:
      static cq
      _M_invoke(const bw& dm, bn... bj)
      {
 auto __trans_tmp_1 = by::dh<bm>::cb(dm);
 be<cq>(*__trans_tmp_1,
         bj...);
      }
    };
  template<typename cq, typename... bn>
    class function<cq(bn...)>
    : by
    {
      template<typename d, typename >
 using _Requires = typename a<d::am>::c;
    public:
      function() : by() {}
      function(function&& )
      ;
      template<typename bm,
        typename = _Requires<an<!bool()> , void>,
        typename = _Requires<bd<> , void>>
 function(bm);
      using _Invoker_type = cq (*)(const bw&, bn...);
      _Invoker_type _M_invoker;
  };
  template<typename cq, typename... bn>
    template<typename bm, typename, typename>
      function<cq(bn...)>::
      function(bm bb)
      {
 typedef cr<cq(bn...), bm> _My_handler;
  _My_handler::dl(ce, bb);
     _M_invoker = _My_handler::_M_invoke;
      }
    struct da
    {
      typedef at::au
        au;
      struct _Vector_impl_data
      {
 au _M_start;
 au _M_finish;
 au _M_end_of_storage;
 _Vector_impl_data() : _M_start( _M_finish)
 {}
      }di;
    };
}
namespace cl {
namespace sycl {
template <class h, class = std::allocator<h>>
using vector_class = std::da;
using cs = std::basic_string<char>;
template <class ct> using function_class = std::function<ct>;
template <class h, class = std::cm<h>>
using unique_ptr_class = std::bz<h>;
template <class > using cv = std::cy;
namespace detail {
template <typename h>
using cw =
    cz<std::as<h>::am,
                  h,
    cz<
        std::ap<h>::am, h,
        cz<std::ar<h>::am,
                      int, h>>>;
template <int dimensions = 1> class array {
public:
  array(e<1, f> )
      {}
  f &operator[](int dimension) {
    return common_array[dimension];
  }
  f common_array[dimensions];
};
}
template <int dimensions = 1> class m : public detail::array<dimensions> {
  using base = detail::array<>;
public:
  m(detail::e<1, f> dk) : base(dk) {}
};
template <int dimensions > class id : detail::array<dimensions> {};
}
}
namespace std {
    class bitset
    {
    public:
      bitset(long ) {}
    };
}
namespace cl {
namespace sycl {
namespace detail {
class dc {
protected:
  dc(
      std::bitset DataLessProps)
      : dd(DataLessProps) {}
  std::bitset dd;
  std::da dn;
};
}
class property_list : detail::dc {
public:
  property_list() :dc(false) {}
};
class exception_list ;
using async_handler = function_class<void(exception_list)>;
class context {
  cv<int> impl;
};
class event {
  cv<int> impl;
};
namespace detail {
using df = int;
using dg = int;
using Requirement =  int;
}
class device {
  cv<int> impl;
};
class device_selector {
public:
  device select_device() const;
  virtual int operator()(const device &) const ;
};
class default_selector : public device_selector {
  int operator()(const device &) const ;
};
namespace detail {
class kernel_impl
  ;
template <typename i, typename > struct aa {
  using ab = i;
};
}
template <bool = true> class dj ;
namespace detail {
class ArgDesc ;
class NDRDescT {
  void setNDRangeLeftover() {
    for (int I ; I < 3; ++I) 
      GlobalSize[I] = 1;
  }
public:
  template <int Dims_> void set(m<Dims_> ) {
    setNDRangeLeftover();
    ac = Dims_;
  }
  m<3> GlobalSize;
  m<3> LocalSize;
  id<3> GlobalOffset;
  m<3> NumWorkGroups;
  f ac;
};
class HostKernelBase {
public:
  virtual char *getPtr() = 0;
};
template <class k, class , int , typename >
class HostKernel : public HostKernelBase {
  k MKernel;
public:
  HostKernel(k Kernel) : MKernel(Kernel) {}
  char *getPtr() { return reinterpret_cast<char *>(&MKernel); }
};
class stream_impl;
class queue_impl;
class CG {
public:
  enum CG_VERSION {
    V1 = 1};
  enum CGTYPE {
    KERNEL };
};
template <typename , typename >
using lambda_arg_type = decltype(0);
}
class handler {
  void setType(detail::CG::CGTYPE ) {
    detail::CG::CG_VERSION Version = detail::CG::V1;
    MCGType = static_cast<detail::CG::CGTYPE>(
        Version);
  }
  void
  extractArgsAndReqsFromLambda(char *, f ,
                               const detail::kernel_param_desc_t *,
                               bool );
  template <typename j, typename k, int ac,
            typename LambdaArgType>
  void StoreLambda(k l) {
    MHostKernel.bq(
        new detail::HostKernel<k, LambdaArgType, ac, j>(
            l));
    using KI = detail::w<j>;
      extractArgsAndReqsFromLambda(MHostKernel->getPtr(), KI::t(),
                                   &KI::z(0), KI::aj);
      MKernelName = KI::s();
     }
  template <typename j, typename k, int ac>
  void ad(m<ac> af,
                                k l) {
    using LambdaArgType = detail::lambda_arg_type<k, dj<>>;
    using r =
        std::ao<std::aq<LambdaArgType>::am ,
                                  dj<>, LambdaArgType>;
    using n =
        typename detail::aa<j, k>::ab;
      MNDRDesc.set(af);
      StoreLambda<n, k, ac, r>
          (l);
      setType(detail::CG::KERNEL);
  }
public:
  template <typename j , typename k>
  void parallel_for(m<> af, k l) {
    ad<j>(af, l);
  }
  cv<detail::queue_impl> MQueue;
  vector_class<vector_class<char>> MArgsStorage;
  vector_class<detail::df> MAccStorage;
  vector_class<detail::dg> MLocalAccStorage;
  vector_class<cv<detail::stream_impl>> MStreamStorage;
  vector_class<cv<void>> MSharedPtrStorage;
  vector_class<detail::ArgDesc> MArgs;
  vector_class<detail::ArgDesc> MAssociatedAccesors;
  vector_class<detail::Requirement > MRequirements;
  detail::NDRDescT MNDRDesc;
  cs MKernelName;
  cv<detail::kernel_impl> MKernel;
  detail::CG::CGTYPE MCGType ;
  unique_ptr_class<detail::HostKernelBase> MHostKernel;
};
class queue {
public:
  queue(property_list PropList = {})
      : queue(default_selector(), {}, PropList) {}
  queue(const device_selector &DeviceSelector,
        async_handler AsyncHandler, property_list PropList )
      : queue(DeviceSelector.select_device(), AsyncHandler, PropList) {}
  queue(const device &, const async_handler &,
                 const property_list &);
  context get_context() const;
  device get_device() const;
  template <typename h> event submit(h CGF , detail::code_location CodeLoc ) {
    return submit_impl(CGF, CodeLoc);
  }
  void wait(detail::code_location CodeLoc = detail::code_location()) {
    wait_proxy(CodeLoc);
  }
  void wait_proxy(const detail::code_location &);
  template <typename j , typename k>
  event parallel_for(m<> af,
                     k l ) {
    return p<j>(af, l);
  }
  event submit_impl(function_class<void(handler &)> ,
                    const detail::code_location &);
  template <typename j , typename k,
            int ac>
  event p(
      m<ac> af, k l,
       detail::code_location CodeLoc = detail::code_location::current()) {
    return submit(
        [&](handler &ag) {
          ag.parallel_for<j>(af,
                                                            l);
        },
        CodeLoc);
  }
};
void *malloc_shared(f , const device &,
                                  const context &);
template <typename >
int *malloc_shared(f ,  device Dev,  context Ctxt) {
  return static_cast<int *>(malloc_shared(sizeof(int), Dev, Ctxt));
}
template <typename h> int *malloc_shared(f Count, queue &Q) {
  return malloc_shared<h>(Count, Q.get_device(), Q.get_context());
}
}
}using namespace cl::sycl;

int i=-1;
int main() {
 i = 11;
 queue Q;
 int *x = malloc_shared<int>(100,Q);
 Q.parallel_for<class mykern>(10, [=](auto g) {
  x[g]= i;
 });
 Q.wait();
 printf("%d\n", x[0]);
 return 0;
}

// __CLANG_OFFLOAD_BUNDLE____END__ host-x86_64-unknown-linux-gnu
erichkeane commented 3 years ago

I looked at the IR and int header from the CFE and can't find a reason why the answer would be wrong. I'm guessing this is something later in the process assuming that all globals are unmodified :)

So the error here is that the global-variable-not-static/const error isn't working for this for some reason. @premanandrao I seem to remember you were working on that at one point? Do you have an idea on why that wouldn't be working here?

premanandrao commented 3 years ago

I looked at the IR and int header from the CFE and can't find a reason why the answer would be wrong. I'm guessing this is something later in the process assuming that all globals are unmodified :)

So the error here is that the global-variable-not-static/const error isn't working for this for some reason. @premanandrao I seem to remember you were working on that at one point? Do you have an idea on why that wouldn't be working here?

I think there is a general bug where because of how the deferred diagnostic mechanism works, the declaration that is associated with the diagnostic is the template declaration, where as at the point of diagnostic emission, the declaration is the instantiated version. And because they don't match, the diagnostic emission is skipped. Let me see if that is the case with this reproducer.

AlexeySachkov commented 7 months ago

Using a recent local intel/llvm build I can see that diagnostic is now in place:

$ ./bin/clang++ -fsycl t.cpp
t.cpp:12:9: error: SYCL kernel cannot use a non-const global variable
   12 |   x[g]= i;
      |         ^

The issue seems to be resolved, so I will close it. Please let us know if it is still a problem