CUDA: unions in kernel arguments not copied completely if member contains padding

mkuron commented 2 years ago

Summary

I have a CUDA kernel that takes a union as one of its parameters. One of the union's member types is a struct (called Ptrs in my minimal working example) that contains some padding due to the memory alignment requirements of its own members. I initialize the union as its other member type (Data in the MWE) and pass it to the kernel launch. Inside the kernel however, some of the contents of Data appear to be zeroed out -- notably those corresponding to the padding region of Ptrs. It seems like Clang looks at the padding inside one of the union's member types and assumes that all the members have holes in the same places.

When compiling the CUDA kernel, Clang should make no assumption about which one of the member types is currently stored in a union that it gets as an argument, but it clearly does and it's the wrong assumption in this case. We initially found this bug in conjunction with thrust and mpark::variant, but it can be observed with plain CUDA code too as in the MWE.

Versions

I reproduced this bug with multiple Clang versions between 7.0.0 and 14.0.0-rc1, as well as the current master branch. Nvidia's nvcc compiler does not exhibit such an issue.

Starting with https://github.com/llvm/llvm-project/commit/d0615a93bb6d7aedc43323dc8957fe57e86ed8ae, my MWE required a minor modification to make sure that the argument copy wasn't optimized out. This optimization is not generally possible (e.g. when using thrust), so we cannot rely on it accidentally fixing the real issue.

Minimal working example

Compile with clang++ -O3 -std=c++14 --cuda-gpu-arch=sm_70 mwe.cu -o mwe.cu -lcudart. The issue can also be observed without -O3. Add -DFIX_IT to get a struct without padding, in which case the issue does not occur.

#include <algorithm>
#include <iostream>
#include <vector>

#include <cuda.h>

struct Ptrs
{
  char s1, s2, s3, s4, s5, s6, s7;
#ifdef FIX_IT
  char s8;
#endif
  void const * data;
};

template <class T, unsigned N>
struct Data
{
  T data[N];
};

template <class T, unsigned N>
union DataType
{
  Data<T,N> d;
  Ptrs p;
};

__global__ void transform(DataType<unsigned, 4> umap, unsigned n, unsigned * result)
{
  const unsigned i = threadIdx.x;

  if(n > 0xffff) {
    /* This condition is never true, but the next line ensures that `umap` is copied into writable memory.
     * Clang 13 (https://github.com/llvm/llvm-project/commit/d0615a93bb6d7aedc43323dc8957fe57e86ed8ae)
     * introduced an optimization that will otherwise mask the issue we are trying to demonstrate.
     */
    umap.d.data[i] = 0;
  }

  if(i < n)
  {
    result[i] = umap.d.data[i];
  }
}

bool test()
{
  std::vector<unsigned> map{23U, 0xffffffff, 42U, 13U};
  DataType<unsigned, 4> umap;
  Data<unsigned, 4> d;
  std::copy(map.begin(), map.end(), &d.data[0]);
  umap.d = d;

  const unsigned mod = map.size();

  std::vector<unsigned> h_values(16);

  unsigned * values = nullptr;
  cudaMalloc(&values, h_values.size() * sizeof(unsigned));
  transform<<<1, h_values.size()>>>(umap, h_values.size(), values);
  cudaMemcpy(h_values.data(), values, h_values.size() * sizeof(unsigned), cudaMemcpyDeviceToHost);
  cudaFree(values);

  bool good = true;
  for(int i = 0; i < mod; ++i)
  {
    std::cout << h_values[i];
    if (h_values[i] != map[i])
    {
      std::cout << " wrong";
      good = false;
    }
    std::cout << std::endl;
  }

  return good;
}

int main()
{
    bool good = test();
    return (good ? 0 : 1);
}

Assembly

To see what is happening, compile the MWE with and without -DFIX_IT and compare the NVPTX assembly: https://godbolt.org/z/W7PY9q3sj . You can see that in the former case, it generates ld.param and st.local instructions that cover the entire size of the union, while in the latter case it skips those bytes that are padding.

Artem-B commented 2 years ago

Interesting. Generated IR does not have Data in the DataType: : https://godbolt.org/z/YhbPbecjx

%union.DataType = type { %struct.Ptrs }
%struct.Ptrs = type { i8, i8, i8, i8, i8, i8, i8, i8* }

When it's time to lower the byval arguments, LLVM only sees Ptrs and that's what it ends up copying. It has no idea that the parameter is actually a union. FIX_IT just fills the gap in the union, but does not really fix the root cause.

This is indeed problematic. There's a fundamental disconnect here -- union-ness is handled by the front-end, but when LLVM needs to generate IR to copy the argument deep down in LLVM, we have no idea that there's any other data out there other than the single union field picked by the front-end.

I guess one thing we could try is to convince the front-end to generate union types in LLVM in a way that the type covers the whole union storage.

Artem-B commented 2 years ago

Actually, we don't need the front-end changes. memcpy of the the byval parameter instead of load/store should work, as the front-end-generated type should always have the right size and that's all we care about.

mkuron commented 2 years ago

Thanks for your quick reply, @Artem-B.

when LLVM needs to generate IR to copy the argument deep down in LLVM, we have no idea that there's any other data out there other than the single union field picked by the front-end.

That seems dangerous in general, not only in a CUDA context. If information about one (the largest?) of the union's members is available to a backend, it might make optimizations based on that. Which may be wrong if a union contains one of its other member types, like in the present example. Can we be sure that neither other parts of the NVPTX backend nor other backends ever look inside a %union type and use any of its properties beyond the overall size -- like member size, member alignment, etc.?

I guess one thing we could try is to convince the front-end to generate union types in LLVM in a way that the type covers the whole union storage.

Something like %union.DataType = type { [16 x i8] }?

Artem-B commented 2 years ago

Can we be sure that neither other parts of the NVPTX backend nor other backends ever look inside a %union type and use any of its properties beyond the overall size -- like member size, member alignment, etc.?

I believe the answer is yes. Front-end is expected to generate semantically valid IR which for unions would include appropriate bitcasts, data transfers, etc. If that would be broken, I would expect to see fairly widespread failures in regular C++ code.

This particular issue is a special case where NVPTX back-end has to add additional code w/o the information available to the front-end. If that copy were to be generated by clang, it would copy all the data. E.g. https://godbolt.org/z/aa3r7nqvj -- regardless of whether we copy the union as one field or the whole union, clang always generates a memcpy of the whole union. If it needs to pass a union initialized with a field that has a padding gap, it still copies over the whole union: https://godbolt.org/z/jxbxsqMP6

Doing the same during lowering of byval arguments appears to be the right fix for this issue.

mkuron commented 2 years ago

Thanks for your detailed explanation, @Artem-B. Did you get a chance yet to look at how to fix this? I was actually able to put together a simple patch (see below) that resolves the issue for my minimal example code, but leads to CUDA illegal memory access errors when I try it on my actual code which involves Thrust and is generally more complex. So I will need to rely on you to get a complete patch plus an LLVM unit test.

diff --git a/llvm/lib/Target/NVPTX/NVPTXLowerArgs.cpp b/llvm/lib/Target/NVPTX/NVPTXLowerArgs.cpp
index 6183019de43d..46d6109b6bb0 100644
--- a/llvm/lib/Target/NVPTX/NVPTXLowerArgs.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXLowerArgs.cpp
@@ -92,12 +92,14 @@
 #include "NVPTXTargetMachine.h"
 #include "NVPTXUtilities.h"
 #include "MCTargetDesc/NVPTXBaseInfo.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
 #include "llvm/Analysis/ValueTracking.h"
 #include "llvm/IR/Function.h"
 #include "llvm/IR/Instructions.h"
 #include "llvm/IR/Module.h"
 #include "llvm/IR/Type.h"
 #include "llvm/Pass.h"
+#include "llvm/Transforms/Utils/LowerMemIntrinsics.h"

 #define DEBUG_TYPE "nvptx-lower-args"

@@ -130,6 +132,10 @@ public:
     return "Lower pointer arguments of CUDA kernels";
   }

+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.addRequired<TargetTransformInfoWrapperPass>();
+  }
+
 private:
   const NVPTXTargetMachine *TM;
 };
@@ -290,10 +296,14 @@ void NVPTXLowerArgs::handleByValParam(Argument *Arg) {
   // Be sure to propagate alignment to this load; LLVM doesn't know that NVPTX
   // addrspacecast preserves alignment.  Since params are constant, this load is
   // definitely not volatile.
-  LoadInst *LI =
-      new LoadInst(StructType, ArgInParam, Arg->getName(),
-                   /*isVolatile=*/false, AllocA->getAlign(), FirstInst);
-  new StoreInst(LI, AllocA, FirstInst);
+  const uint64_t Size = DL.getTypeSizeInBits(Arg->getType());
+  LLVMContext &Ctx = Func->getParent()->getContext();
+  const TargetTransformInfo &TTI =
+      getAnalysis<TargetTransformInfoWrapperPass>().getTTI(*Func);
+  createMemCpyLoopKnownSize(FirstInst, ArgInParam, AllocA,
+                            ConstantInt::get(Type::getInt64Ty(Ctx), Size),
+                            AllocA->getAlign(), AllocA->getAlign(),
+                            false, false, TTI);
 }

 void NVPTXLowerArgs::markPointerAsGlobal(Value *Ptr) {

Artem-B commented 2 years ago

I was planning to incorporate this into a work-in-progress patch along with other improvements in NVPTXLowerArgs.cpp but it's got pushed back by other work.

I think the crash is due to attempt to copy 8x the amount we want to copy as Size we've got is in bits.

const uint64_t Size = DL.getTypeSizeInBits(Arg->getType());

You should verify that the memcpy produces ld.param instructions and not the regular ld.

Also, memcpy will be sub-optimal performance-wise as it would not be able to use 128-bit loads/stores. In the end we may want to use load/store on an opaque array type of the size that we need to copy.

Artem-B commented 2 years ago

Looks like argument lowering is only part of the problem. We also need to address the same issue with function call lowering so the arguments we pass and return are also copied completely

E.g: https://godbolt.org/z/d111d6asP. You can see that we copy only struct fields, but not the gaps between them.

mkuron commented 1 year ago

I was planning to incorporate this into a work-in-progress patch along with other improvements in NVPTXLowerArgs.cpp

@Artem-B, what's the status of this work-in-progress patch? We're still running into this bug, so I was wondering whether there might have been some progress in the past year.

Artem-B commented 1 year ago

Unfortunately, none. It got preempted by other work and then slipped through the cracks. :-(

I probably would not have much time to deal with it till August.

mkuron commented 1 year ago

@Artem-B, no problem. August or even September aligns well with our own plans, so I'm looking forward to any solution you might come up with.

mkuron commented 9 months ago

@Artem-B, any update? I suspect this bug might become more relevant to other users besides myself, now that libcu++ added cuda::variant in https://github.com/NVIDIA/cccl/pull/1076.

Artem-B commented 8 months ago

Unfortunately, no. It keeps moving down my todo list. :-(

It's just one of the bugs that happens to need nontrivial effort to investigate and fix, but has no impact on our actual use cases for my day job, so it does get preempted by more immediate issues. I'll try to find someone willing to take it on as a side project, but no promises.

Sorry about this.

mkuron commented 3 weeks ago

We just ran into this bug again, and I was finally able to come up with a small patch that fixes the bug to the extent that we observe:

--- a/llvm/lib/Target/NVPTX/NVPTXLowerArgs.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXLowerArgs.cpp
@@ -383,8 +383,14 @@ void NVPTXLowerArgs::handleByValParam(const NVPTXTargetMachine &TM,
   // Be sure to propagate alignment to this load; LLVM doesn't know that NVPTX
   // addrspacecast preserves alignment.  Since params are constant, this load is
   // definitely not volatile.
+  const auto StructBytes = *AllocA->getAllocationSize(DL);
+  const auto ChunkBytes = (StructBytes % 8 == 0) ? 8 :
+                          (StructBytes % 4 == 0) ? 4 :
+                          (StructBytes % 2 == 0) ? 2 : 1;
+  Type *ChunkType = Type::getIntNTy(Func->getContext(), 8*ChunkBytes);
+  Type *OpaqueType = ArrayType::get(ChunkType, StructBytes / ChunkBytes);
   LoadInst *LI =
-      new LoadInst(StructType, ArgInParam, Arg->getName(),
+      new LoadInst(OpaqueType, ArgInParam, Arg->getName(),
                    /*isVolatile=*/false, AllocA->getAlign(), FirstInst);
   new StoreInst(LI, AllocA, FirstInst);
 }

My patch is not perfect as it doesn't deal with some of @Artem-B's comments above:

memcpy will be sub-optimal performance-wise as it would not be able to use 128-bit loads/stores. In the end we may want to use load/store on an opaque array type of the size that we need to copy.

It should probably generate StructBytes / 16 128-bit loads/stores plus at most one each of 64, 32, 16, and 8-bit loads/stores for the remainder instead of simply using the largest size that evenly divides the union's size.

Looks like argument lowering is only part of the problem. We also need to address the same issue with function call lowering so the arguments we pass and return are also copied completely

We don't return unions in our own code, so we haven't encountered that variation of the bug ourselves. I can't find the corresponding bit in the LLVM NVPTX backend code, so I haven't looked into it.

Do you have any comments on my patch, @Artem-B? I am admittedly no LLVM expert, but it seems like my patch might be able to bridge the time until someone with more LLVM experience gets a chance to look at this bug.

Artem-B commented 3 weeks ago

This could work. Converting the type to a plain array should block SROA's attempt to optimize the copy into the field-by-field copy on a struct type representing the union, and we should still be able to copy it all effectively. If alignment allows it, the loads/stores will be coalesced into vectorized loads/stores, so there should be no performance regression. It should actually help a bit, as by-field copying would likely prevent such load/store vectorization.

Please make your proposed patch a pull request and we'll discuss the details there.

mkuron commented 3 weeks ago

Please make your proposed patch a pull request

Thanks, @Artem-B. Before I do that, how certain are we that this kind of problem only affects the NVPTX backend? For example, #76017 reports something that looks similar on x86 and suggests a more general solution around CGRecordLowering::lowerUnion: https://github.com/llvm/llvm-project/blob/4497ec293a6e745be817dc88027169bd5e4f7246/clang/lib/CodeGen/CGRecordLayoutBuilder.cpp#L318-L369

If alignment allows it, the loads/stores will be coalesced into vectorized loads/stores

Not sure if that would be different, but in my initial test with ChunkSize = 1, it never coalesced into wider loads/stores.

Artem-B commented 3 weeks ago

how certain are we that this kind of problem only affects the NVPTX backend? For example, https://github.com/llvm/llvm-project/issues/76017 reports something that looks similar on x86

The root cause appears to be the same -- SROA pass assumes that the struct fields are all that matters, and do not copy padding which may be part of the other union fields. NVPTX may just be unlucky to run into this more often as it requires copying byval args.

and suggests a more general solution around

I assume you mean these:

Fix idea 1: lower C/C++ unions into byte arrays of size of the maximum union member.

This is effectively, a simpler version of what your patch is doing -- convert to a flat array, to make all data, including padding, to be important to SROA.

Fix idea 2: lower C/C++ unions into intermediate llvm::StructType which is the union member of maximum size with padding being filled by extra members (byte arrays or i1 / i2 / i4 / i16 / i32 types)

That's another way to ensure that the padding is copied. It does have a potential negative side effect in that separately copying actual fields and padding will likely be harder to load/store efficiently. E.g struct {short s, char c} could be loaded/stored as a i32 with your approach, but would have to be three loads/stores: i16 for s, i8 for c, i8 for padding. While they could be coalesced in theory, we're mostly vectorizing uniform series of same-size scalar loads/stores only, so it will probably be an i16 and v2i8 load/store.

mkuron commented 3 weeks ago

Fix idea 1: lower C/C++ unions into byte arrays of size of the maximum union member.

This is effectively, a simpler version of what your patch is doing -- convert to a flat array, to make all data, including padding, to be important to SROA.

Wouldn‘t that be the preferable solution? Convert the union into an opaque byte array in the very same place where we discard the knowledge whether it is a struct or a union. Instead of pretending the union is a struct and having to be careful in the target backends whenever that struct that may be a union is copied.

Artem-B commented 3 weeks ago

I think the root cause here is clang.

LLVM has no notion of unions and is under no obligation to preserve anything in the gaps. The alignment is a target data layout property and I do not think LLVM provides any guarantees about their content. Clang currently generates a struct type to represent a union and it does not represent all union variants. It should've used a type that tells LLVM about all the parts of the data that may be used.

I'm sure I'm missing some nuances here. I suspect that if it was that easy it would have been done already.

mkuron commented 3 weeks ago

I think the root cause here is clang.

Okay, in that case it probably doesn't make sense to take my patch to master. It really only fixes one symptom. It would be a stopgap solution at best.

* Clang 13 (https://github.com/llvm/llvm-project/commit/d0615a93bb6d7aedc43323dc8957fe57e86ed8ae)
* introduced an optimization that will otherwise mask the issue we are trying to demonstrate.

The fact that I don't see the problem occur consistently across our large GPU codebase left me wondering. In fact, all our kernel arguments are const (unlike in the sample code in my initial post), so none of these temporary copies should even exist (50c7504a93fdb90c26870db8c8ea7add895c7725, d0615a93bb6d7aedc43323dc8957fe57e86ed8ae). However, https://godbolt.org/z/orEb5b5qq shows that they are only skipped when the size of the struct argument is 1, 2, 4, or 8 bytes (-mllvm -debug-only=nvptx-lower-args will helpfully print a message whenever a copy is introduced). When it is some other size, the InstCombine pass does not convert the call void @llvm.memcpy into a load/store, and this memcpy is converted to load/store only after the temporary copies have already been inserted. I should probably open a separate issue for that -- it's not a bug, but certainly a missed optimization opportunity (and not even limited to something as exotic as unions).

Edit: It looks like this other issue might be partially fixed by #106423, though it still has a TODO about unnecessarily forced copies on sm_70 and earlier.

Artem-B commented 3 weeks ago

The fact that I don't see the problem occur consistently across our large GPU codebase left me wondering.

This is expected, I think. We're trying to eliminate as many copies as we can. The cases where I see this particular issue triggered is when we have to create a copy very late in the pipeline (so a lot of generic IR optimizations will not see it), and we do feed that IR to SROA which is the pass which does rely on struct fields to optimize copying.

Changing argument lowering to lower as an array may not fix all instances, but it will fix a known instance of this issue. It will improve things, even if it may not be a complete fix. If/when clang will start generating flat arrays for unions, this change will effectively become a no-op, so I do not see much of a downside here.

all our kernel arguments are const (unlike in the sample code in my initial post), so none of these temporary copies should even exist

It's more complicated. The key issue here is whether we ever take a pointer to an argument. Taking a pointer of a byval parameter requires making a copy. On a newer GPU we do have a way to take a pointer, but then in order to avoid a copy we must prove that the argument is never written to, either explicitly or, possibly somewhere else, if it escapes. const alone does not guarantee no-writes (it's easy enough to drop the const by casting) and all bets are off if the pointer escapes. So, whether we produce the copy or not hepends on LLVM's ability to analyze all uses of the argument.

mkuron commented 3 weeks ago

Thanks for the detailed answer, @Artem-B.

If/when clang will start generating flat arrays for unions

Would you like me to create a separate follow-up issue for considering changing how Clang lowers unions?

this change will effectively become a no-op, so I do not see much of a downside here.

Okay, I'll submit a pull request shortly.

So, whether we produce the copy or not depends on LLVM's ability to analyze all uses of the argument.

Would you like me to create separate issues when I find more unnecessary copies? I assume that #106423 would eliminate the copies from https://godbolt.org/z/orEb5b5qq, at least on sm_70 and newer. I haven't tested #106423 against our codebase yet, but it seemed like the bulk of the unnecessary copies we were getting was for the same reason as in the sample. There may be others, and it seems like it should be possible to also eliminate them from the memcpy on sm_60 and older by unrolling before copying.

Artem-B commented 3 weeks ago

Would you like me to create a separate follow-up issue for considering changing how Clang lowers unions?

I think so. I did some digging in LLVM spec, and AFAICT, SROA is correct here. If it sees an aggregate, it assumes that loading/storing all fields of an aggregate is equivalent of loading/storing complete aggregate.

From the LLVM docs for load https://llvm.org/docs/LangRef.html#id211 :

If the value being loaded is of aggregate type, the bytes that correspond to padding may be accessed but are ignored, because it is impossible to observe padding from the loaded aggregate value.

For store it says: https://llvm.org/docs/LangRef.html#id216

If <value> is of aggregate type, padding is filled with undef.

AFAICT, it's pretty clear that padding is not preserved.

llvm / llvm-project

CUDA: unions in kernel arguments not copied completely if member contains padding #53710