Closed wence- closed 1 month ago
Investigating this I found the problem is reported through thrust::exclusive_scan
which is implemented using CUB's Device::ExclusiveScan
. Exclusive-scan is a unique function where the last element is not processed. For example,
int data[6] = {1, 0, 2, 2, 1, 3};
thrust::exclusive_scan(data, data + 6, data); // in-place scan
// data is now {0, 1, 1, 3, 5, 6}
Note that the last element in data
is not actually used. Any value can be there and it would not affect the result.
int data[6] = {1, 0, 2, 2, 1, X};
thrust::exclusive_scan(data, data + 6, data); // in-place scan
// data is now {0, 1, 1, 3, 5, 6}
The normal processing of strings APIs that return strings in libcudf is to do two passes. The 1st pass will compute the output sizes and is used to build the output chars column and offsets. The 2nd pass will fill in the chars column. The number of output offsets are usually known and so the output offsets column is created first. The 1st pass will place the sizes into this column temporarily. The sizes are usually only needed to build the offsets so there is normally no need to put the sizes into a separate temporary buffer/vector since the offsets can be computed directly in-place.
Although the number of output sizes is the same as the number of output rows, the offsets count is always +1 the number of output rows. The first element is 0 and the final element is the the total size in bytes. The exclusive-scan does this for us in a single step:
int output_rows = 5;
int offsets[] = {1, 2, 0, 2, 4, X}; // populated with sizes, the last entry is garbage
thrust::exclusive_scan(offsets, offsets + 6, offsets);
// offsets is now {0, 1, 3, 3, 5, 9}
Although the last element may be read by thrust/CUB, it is never actually used. This is a limitation of the initcheck
in compute-sanitizer
. It only knows the uninitialized memory has been read, it has no idea that is not actually used in producing output. So I don't think we can reliably use initcheck
to verify invalid usage of uninitialized data.
I would not recommend fixing this by artificially setting the last entry since it would impact the performance unnecessarily in the many, many places that this is used.
Is this something we should take to the compute-sanitizer team?
I would have thought that a write to the final entry would not provoke an uninitialised read warning (it wouldn't in host code with valgrind I think)
Is this something we should take to the compute-sanitizer team?
I doubt they can trace the read down to where it eventually is used/not-used -- at what level in the stack-trace does it actually get discarded. I'm guessing the values are block loaded into an array and then to a local register and then finally discarded.
I would have thought that a write to the final entry would not provoke an uninitialised read warning (it wouldn't in host code with valgrind I think)
I believe thrust/CUB is doing a read of all available data in some low-level utility probably common to both types of scans or other CUB functions. The fact the last value eventually is not used is specific to exclusive-scan. It may be worth opening an issue with thrust but I fear the solution would be non-performant checks.
In an ideal world (not saying this is in any way possible right now) this seems like a case where compute sanitizer would offer something like pragmas to allow code to say "this block is going to include unsafe accesses, ignore them because it's actually OK". I don't know that there's any other general fix, since IIUC effectively what is happening is that thrust is performing an unsafe access to potentially garbage data but it's OK because the value that is read will never be used.
I believe thrust/CUB is doing a read of all available data in some low-level utility probably common to both types of scans or other CUB functions. The fact the last value eventually is not used is specific to exclusive-scan. It may be worth opening an issue with thrust but I fear the solution would be non-performant checks.
FWIW, I think the cudf usage here is UB (if we're going by thrust modelling std::algorithm). Since the first two arguments to thrust::exclusive_scan
must model InputIterator
which says that the full range of the iterator must be dereferencable. Since it is UB to read an uninitialised value, dereferencing the last entry in the input iterator is bad (since it is uninitialised).
Arguably an exclusive_scan
implementation doesn't need to dereference the last entry of the input iterator, but I can't see anything in the standard that requires that of an implementation.
Related request to guarantee that the final entry of the input iterator is never read by CUB https://github.com/NVIDIA/cccl/issues/876
In terms of general uninitialized memory. Perhaps there should be a compile time option (off by default) for libcudf to zero-initialize device allocations (this is probably much harder than it sounds), to be used as a debugging tool.
I agree, it seems like another case of technically UB but we're taking advantage of knowing the implementation detail that in practice this is safe. I doubt that that std::exclusive_scan standard will be modified to account for the desire to not read the final value, but I suppose thrust could promise that anyway. If there's a cub request then a thrust request also seems appropriate, @jrhemstad WDYT?
technically UB but we're taking advantage of knowing the implementation detail that in practice this is safe.
Although this works right now, this is not how UB works. A sufficiently smart compiler™ will treat UB as an impossible situation, and is free to (for example) eliminate any code related to it. In this circumstance, that might be elimination of the call to exclusive_scan
completely (admittedly the reasoning to get here is pretty convoluted, so I don't think it is very likely that this would occur, since the iteration bounds to the iterator that partially initialise the input to exclusive_scan
are not compile-time known). So it would be a totally valid to erase any calls to sizes_to_offsets
that occur in these contexts. Even if the implementation of exclusive_scan
remains unchanged.
I agree with @davidwendt that running an extra kernel to memset the final entry in the allocated offsets array to zero is not the solution.
I would prefer to try and fix this in CUB, and ideally the C++ standard, though I don't know how amenable they'd be to changing the definition of exclusive_scan.
I mean, that's the definition of an implementation detail right? UB says an implementation can do whatever it wants. One of the options is to do something that we think is sensible. We know empirically that the code thrust produces is not being elided in the way you describe by the CUDA compiler, even though it could be. I'm not saying that we should rely on the current behavior, of course. We need thrust to promise not to read the last value, which in turn means we need cub to promise not to read the last value. If we can get that upstreamed to the C++ standard for std::exclusive_scan
, that's even better.
I believe the summary here results in no recommended changes to libcudf.
Would https://github.com/NVIDIA/cccl/issues/876 address this upstream, or do we need more changes? I'm inclined to agree with @wence- that ultimately this can only really be fixed by the C++ standard specifying that exclusive_scan
will not read the last value, but neither that nor any of the other proposed changes above (e.g. making thrust do this safely) seem to be forthcoming since they either are inherently slow processes or they have potentially deleterious performance consequences that we're unlikely to stomach.
Would https://github.com/NVIDIA/cccl/issues/876 address this upstream, or do we need more changes?
I believe it would, yes.
Conversion from (at least) integer to string columns (done a lot for printing on the cudf-python side of things) appears to have uninitialized device memory accesses. Related #8873.
Consider the following:
When run as:
This looks like an off-by-one, but my tracking through the libcudf side of things didn't spot anything, so perhaps it is a bug in thrust or CUB.
Environment overview (please complete the following information)
docker pull
&docker run
commands usedEnvironment details
Click here to see environment details