feat: add reduce kernels

ManasviGoyal commented 1 month ago

Kernels tested for different block sizes

[x] awkward_reduce_argmax
[x] awkward_reduce_argmin
[x] awkward_reduce_count_64
[x] awkward_reduce_countnonzero
[x] awkward_reduce_max
[x] awkward_reduce_min
[x] awkward_reduce_prod_bool
[x] awkward_reduce_sum
[x] awkward_reduce_sum_bool
[x] awkward_reduce_sum_int32_bool_64
[x] awkward_reduce_sum_int64_bool_64
[x] awkward_reduce_prod
[x] awkward_ListOffsetArray_reduce_local_outoffsets_64

lgray commented 3 weeks ago

@ManasviGoyal in trying to implement query 4 from the analysis benchmarks I found some nasty memory scaling:

If you scroll to the bottom of the trace below you'll see that when attempting to execute:

This is processing ~53M rows all at once in the input file, the data fit with no problem in the GPU itself. Same for the histogram that is being filled.

For this test I have merged #3123, #3142, and this PR on top of awkward main. This PR was merged in last.

However, for the ak.sum, where the calculation fails, it is attempting to allocation 71 terabytes of ram on the device. This seems excessive and is indicative of some poor memory scaling properties in the implementation. You'll see that this fails in the ak.sum step and nowhere else.

Here's the full stack trace:

---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
Cell In[3], line 16
     11 MET_pt = ak.to_backend(jetmet.MET_pt, "cuda")
     12 q4_hist = hist.Hist(
     13     "Counts",
     14     hist.Bin("met", "$E_{T}^{miss}$ [GeV]", 100, 0, 200),
     15 )
---> 16 has2jets = ak.sum(Jet_pt > 40, axis=1) >= 2
     17 q4_hist.fill(met=MET_pt[has2jets])
     19 q4_hist.to_hist().plot1d(flow="none");

File [~/coffea-gpu/awkward/src/awkward/_dispatch.py:64](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_dispatch.py#line=63), in named_high_level_function.<locals>.dispatch(*args, **kwargs)
     62 # Failed to find a custom overload, so resume the original function
     63 try:
---> 64     next(gen_or_result)
     65 except StopIteration as err:
     66     return err.value

File [~/coffea-gpu/awkward/src/awkward/operations/ak_sum.py:210](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/operations/ak_sum.py#line=209), in sum(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)
    207 yield (array,)
    209 # Implementation
--> 210 return _impl(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)

File [~/coffea-gpu/awkward/src/awkward/operations/ak_sum.py:277](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/operations/ak_sum.py#line=276), in _impl(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)
    274     layout = ctx.unwrap(array, allow_record=False, primitive_policy="error")
    275 reducer = ak._reducers.Sum()
--> 277 out = ak._do.reduce(
    278     layout,
    279     reducer,
    280     axis=axis,
    281     mask=mask_identity,
    282     keepdims=keepdims,
    283     behavior=ctx.behavior,
    284 )
    285 return ctx.wrap(out, highlevel=highlevel, allow_other=True)

File [~/coffea-gpu/awkward/src/awkward/_do.py:333](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_do.py#line=332), in reduce(layout, reducer, axis, mask, keepdims, behavior)
    331 parents = ak.index.Index64.zeros(layout.length, layout.backend.index_nplike)
    332 shifts = None
--> 333 next = layout._reduce_next(
    334     reducer,
    335     negaxis,
    336     starts,
    337     shifts,
    338     parents,
    339     1,
    340     mask,
    341     keepdims,
    342     behavior,
    343 )
    345 return next[0]

File [~/coffea-gpu/awkward/src/awkward/contents/listoffsetarray.py:1612](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/contents/listoffsetarray.py#line=1611), in ListOffsetArray._reduce_next(self, reducer, negaxis, starts, shifts, parents, outlength, mask, keepdims, behavior)
   1609 trimmed = self._content[self.offsets[0] : self.offsets[-1]]
   1610 nextstarts = self.offsets[:-1]
-> 1612 outcontent = trimmed._reduce_next(
   1613     reducer,
   1614     negaxis,
   1615     nextstarts,
   1616     shifts,
   1617     nextparents,
   1618     globalstarts_length,
   1619     mask,
   1620     keepdims,
   1621     behavior,
   1622 )
   1624 outoffsets = Index64.empty(outlength + 1, index_nplike)
   1625 assert outoffsets.nplike is index_nplike and parents.nplike is index_nplike

File [~/coffea-gpu/awkward/src/awkward/contents/numpyarray.py:1122](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/contents/numpyarray.py#line=1121), in NumpyArray._reduce_next(self, reducer, negaxis, starts, shifts, parents, outlength, mask, keepdims, behavior)
   1119 assert self.is_contiguous
   1120 assert self._data.ndim == 1
-> 1122 out = reducer.apply(self, parents, starts, shifts, outlength)
   1124 if mask:
   1125     outmask = ak.index.Index8.empty(outlength, self._backend.index_nplike)

File [~/coffea-gpu/awkward/src/awkward/_reducers.py:358](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_reducers.py#line=357), in Sum.apply(self, array, parents, starts, shifts, outlength)
    355 if result.dtype in (np.int64, np.uint64):
    356     assert parents.nplike is array.backend.index_nplike
    357     array.backend.maybe_kernel_error(
--> 358         array.backend[
    359             "awkward_reduce_sum_int64_bool_64",
    360             np.int64,
    361             array.dtype.type,
    362             parents.dtype.type,
    363         ](
    364             result,
    365             array.data,
    366             parents.data,
    367             parents.length,
    368             outlength,
    369         )
    370     )
    371 elif result.dtype in (np.int32, np.uint32):
    372     assert parents.nplike is array.backend.index_nplike

File [~/coffea-gpu/awkward/src/awkward/_kernels.py:169](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_kernels.py#line=168), in CupyKernel.__call__(self, *args)
    157 args = (
    158     *args,
    159     len(ak_cuda.cuda_streamptr_to_contexts[cupy_stream_ptr][1]),
    160     ak_cuda.cuda_streamptr_to_contexts[cupy_stream_ptr][0],
    161 )
    162 ak_cuda.cuda_streamptr_to_contexts[cupy_stream_ptr][1].append(
    163     ak_cuda.Invocation(
    164         name=self.key[0],
    165         error_context=ak._errors.ErrorContext.primary(),
    166     )
    167 )
--> 169 self._impl(grid, blocks, args)

File [~/coffea-gpu/awkward/src/awkward/_connect/cuda/_kernel_signatures.py:4337](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_connect/cuda/_kernel_signatures.py#line=4336), in by_signature.<locals>.f(grid, block, args)
   4335     segment = 0
   4336     grid_size = 1
-> 4337 partial = cupy.zeros(outlength * grid_size, dtype=toptr.dtype)
   4338 temp = cupy.zeros(lenparents, dtype=toptr.dtype)
   4339 cuda_kernel_templates.get_function(fetch_specialization(["awkward_reduce_sum_int64_bool_64_a", int64, bool_, parents.dtype]))((grid_size,), block, (toptr, fromptr, parents, lenparents, outlength, partial, temp, invocation_index, err_code))

File ~/.conda/envs/coffea-gpu/lib/python3.12/site-packages/cupy/_creation/basic.py:248, in zeros(shape, dtype, order)
    229 def zeros(
    230         shape: _ShapeLike,
    231         dtype: DTypeLike = float,
    232         order: _OrderCF = 'C',
    233 ) -> NDArray[Any]:
    234     """Returns a new array of given shape and dtype, filled with zeros.
    235 
    236     Args:
   (...)
    246 
    247     """
--> 248     a = cupy.ndarray(shape, dtype, order=order)
    249     a.data.memset_async(0, a.nbytes)
    250     return a

File cupy[/_core/core.pyx:132](https://analytics-hub.fnal.gov/_core/core.pyx#line=131), in cupy._core.core.ndarray.__new__()

File cupy[/_core/core.pyx:220](https://analytics-hub.fnal.gov/_core/core.pyx#line=219), in cupy._core.core._ndarray_base._init()

File cupy[/cuda/memory.pyx:738](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=737), in cupy.cuda.memory.alloc()

File cupy[/cuda/memory.pyx:1424](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1423), in cupy.cuda.memory.MemoryPool.malloc()

File cupy[/cuda/memory.pyx:1445](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1444), in cupy.cuda.memory.MemoryPool.malloc()

File cupy[/cuda/memory.pyx:1116](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1115), in cupy.cuda.memory.SingleDeviceMemoryPool.malloc()

File cupy[/cuda/memory.pyx:1137](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1136), in cupy.cuda.memory.SingleDeviceMemoryPool._malloc()

File cupy[/cuda/memory.pyx:1382](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1381), in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc()

File cupy[/cuda/memory.pyx:1385](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1384), in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc()

OutOfMemoryError: Out of memory allocating 71,381,459,340,288 bytes (allocated so far: 3,718,910,464 bytes).

This error occurred while calling

    ak.sum(
        <Array [[True, False], [...], ..., [False]] type='53446198 * var * ...'>
        axis = 1
    )

ManasviGoyal commented 3 weeks ago

@ManasviGoyal in trying to implement query 4 from the analysis benchmarks I found some nasty memory scaling:

If you scroll to the bottom of the trace below you'll see that when attempting to execute:

This is processing ~53M rows all at once in the input file, the data fit with no problem in the GPU itself. Same for the histogram that is being filled.

For this test I have merged #3123, #3142, and this PR on top of awkward main. This PR was merged in last.

However, for the ak.sum, where the calculation fails, it is attempting to allocation 71 terabytes of ram on the device. This seems excessive and is indicative of some poor memory scaling properties in the implementation. You'll see that this fails in the ak.sum step and nowhere else.

Here's the full stack trace:

---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
Cell In[3], line 16
     11 MET_pt = ak.to_backend(jetmet.MET_pt, "cuda")
     12 q4_hist = hist.Hist(
     13     "Counts",
     14     hist.Bin("met", "$E_{T}^{miss}$ [GeV]", 100, 0, 200),
     15 )
---> 16 has2jets = ak.sum(Jet_pt > 40, axis=1) >= 2
     17 q4_hist.fill(met=MET_pt[has2jets])
     19 q4_hist.to_hist().plot1d(flow="none");

File [~/coffea-gpu/awkward/src/awkward/_dispatch.py:64](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_dispatch.py#line=63), in named_high_level_function.<locals>.dispatch(*args, **kwargs)
     62 # Failed to find a custom overload, so resume the original function
     63 try:
---> 64     next(gen_or_result)
     65 except StopIteration as err:
     66     return err.value

File [~/coffea-gpu/awkward/src/awkward/operations/ak_sum.py:210](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/operations/ak_sum.py#line=209), in sum(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)
    207 yield (array,)
    209 # Implementation
--> 210 return _impl(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)

File [~/coffea-gpu/awkward/src/awkward/operations/ak_sum.py:277](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/operations/ak_sum.py#line=276), in _impl(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)
    274     layout = ctx.unwrap(array, allow_record=False, primitive_policy="error")
    275 reducer = ak._reducers.Sum()
--> 277 out = ak._do.reduce(
    278     layout,
    279     reducer,
    280     axis=axis,
    281     mask=mask_identity,
    282     keepdims=keepdims,
    283     behavior=ctx.behavior,
    284 )
    285 return ctx.wrap(out, highlevel=highlevel, allow_other=True)

File [~/coffea-gpu/awkward/src/awkward/_do.py:333](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_do.py#line=332), in reduce(layout, reducer, axis, mask, keepdims, behavior)
    331 parents = ak.index.Index64.zeros(layout.length, layout.backend.index_nplike)
    332 shifts = None
--> 333 next = layout._reduce_next(
    334     reducer,
    335     negaxis,
    336     starts,
    337     shifts,
    338     parents,
    339     1,
    340     mask,
    341     keepdims,
    342     behavior,
    343 )
    345 return next[0]

File [~/coffea-gpu/awkward/src/awkward/contents/listoffsetarray.py:1612](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/contents/listoffsetarray.py#line=1611), in ListOffsetArray._reduce_next(self, reducer, negaxis, starts, shifts, parents, outlength, mask, keepdims, behavior)
   1609 trimmed = self._content[self.offsets[0] : self.offsets[-1]]
   1610 nextstarts = self.offsets[:-1]
-> 1612 outcontent = trimmed._reduce_next(
   1613     reducer,
   1614     negaxis,
   1615     nextstarts,
   1616     shifts,
   1617     nextparents,
   1618     globalstarts_length,
   1619     mask,
   1620     keepdims,
   1621     behavior,
   1622 )
   1624 outoffsets = Index64.empty(outlength + 1, index_nplike)
   1625 assert outoffsets.nplike is index_nplike and parents.nplike is index_nplike

File [~/coffea-gpu/awkward/src/awkward/contents/numpyarray.py:1122](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/contents/numpyarray.py#line=1121), in NumpyArray._reduce_next(self, reducer, negaxis, starts, shifts, parents, outlength, mask, keepdims, behavior)
   1119 assert self.is_contiguous
   1120 assert self._data.ndim == 1
-> 1122 out = reducer.apply(self, parents, starts, shifts, outlength)
   1124 if mask:
   1125     outmask = ak.index.Index8.empty(outlength, self._backend.index_nplike)

File [~/coffea-gpu/awkward/src/awkward/_reducers.py:358](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_reducers.py#line=357), in Sum.apply(self, array, parents, starts, shifts, outlength)
    355 if result.dtype in (np.int64, np.uint64):
    356     assert parents.nplike is array.backend.index_nplike
    357     array.backend.maybe_kernel_error(
--> 358         array.backend[
    359             "awkward_reduce_sum_int64_bool_64",
    360             np.int64,
    361             array.dtype.type,
    362             parents.dtype.type,
    363         ](
    364             result,
    365             array.data,
    366             parents.data,
    367             parents.length,
    368             outlength,
    369         )
    370     )
    371 elif result.dtype in (np.int32, np.uint32):
    372     assert parents.nplike is array.backend.index_nplike

File [~/coffea-gpu/awkward/src/awkward/_kernels.py:169](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_kernels.py#line=168), in CupyKernel.__call__(self, *args)
    157 args = (
    158     *args,
    159     len(ak_cuda.cuda_streamptr_to_contexts[cupy_stream_ptr][1]),
    160     ak_cuda.cuda_streamptr_to_contexts[cupy_stream_ptr][0],
    161 )
    162 ak_cuda.cuda_streamptr_to_contexts[cupy_stream_ptr][1].append(
    163     ak_cuda.Invocation(
    164         name=self.key[0],
    165         error_context=ak._errors.ErrorContext.primary(),
    166     )
    167 )
--> 169 self._impl(grid, blocks, args)

File [~/coffea-gpu/awkward/src/awkward/_connect/cuda/_kernel_signatures.py:4337](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_connect/cuda/_kernel_signatures.py#line=4336), in by_signature.<locals>.f(grid, block, args)
   4335     segment = 0
   4336     grid_size = 1
-> 4337 partial = cupy.zeros(outlength * grid_size, dtype=toptr.dtype)
   4338 temp = cupy.zeros(lenparents, dtype=toptr.dtype)
   4339 cuda_kernel_templates.get_function(fetch_specialization(["awkward_reduce_sum_int64_bool_64_a", int64, bool_, parents.dtype]))((grid_size,), block, (toptr, fromptr, parents, lenparents, outlength, partial, temp, invocation_index, err_code))

File ~/.conda/envs/coffea-gpu/lib/python3.12/site-packages/cupy/_creation/basic.py:248, in zeros(shape, dtype, order)
    229 def zeros(
    230         shape: _ShapeLike,
    231         dtype: DTypeLike = float,
    232         order: _OrderCF = 'C',
    233 ) -> NDArray[Any]:
    234     """Returns a new array of given shape and dtype, filled with zeros.
    235 
    236     Args:
   (...)
    246 
    247     """
--> 248     a = cupy.ndarray(shape, dtype, order=order)
    249     a.data.memset_async(0, a.nbytes)
    250     return a

File cupy[/_core/core.pyx:132](https://analytics-hub.fnal.gov/_core/core.pyx#line=131), in cupy._core.core.ndarray.__new__()

File cupy[/_core/core.pyx:220](https://analytics-hub.fnal.gov/_core/core.pyx#line=219), in cupy._core.core._ndarray_base._init()

File cupy[/cuda/memory.pyx:738](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=737), in cupy.cuda.memory.alloc()

File cupy[/cuda/memory.pyx:1424](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1423), in cupy.cuda.memory.MemoryPool.malloc()

File cupy[/cuda/memory.pyx:1445](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1444), in cupy.cuda.memory.MemoryPool.malloc()

File cupy[/cuda/memory.pyx:1116](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1115), in cupy.cuda.memory.SingleDeviceMemoryPool.malloc()

File cupy[/cuda/memory.pyx:1137](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1136), in cupy.cuda.memory.SingleDeviceMemoryPool._malloc()

File cupy[/cuda/memory.pyx:1382](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1381), in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc()

File cupy[/cuda/memory.pyx:1385](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1384), in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc()

OutOfMemoryError: Out of memory allocating 71,381,459,340,288 bytes (allocated so far: 3,718,910,464 bytes).

This error occurred while calling

    ak.sum(
        <Array [[True, False], [...], ..., [False]] type='53446198 * var * ...'>
        axis = 1
    )

Hi, I am still working on these kernels and need to fix a few things. I will update once I am done with this PR. The issue is most likely use to the use of partial. I plan to remove it. PR #3123 is just for experimenting in separate python scripts. That is not the implementation.

lgray commented 3 weeks ago

No worries - just reporting what I'm finding with things as they are. Thanks!

ManasviGoyal commented 3 weeks ago

No worries - just reporting what I'm finding with things as they are. Thanks!

Yes. It's very helpful since I can only test of a limited number of cases so knowing how it works for actual data helps in identifying the issues. Thanks! I will keep you updated.

lgray commented 3 weeks ago

The change to accumulator with atomics certainly fixed the memory issue, it's a bit slower than I expected for a sum though. ~250 MHz throughput for summing bools into int64.

As an optimization for sums on the last dimension, couldn't you write this without atomics or any race conditions by having each thread sum over the last dimension into an array of one less dimension? Or is the thread divergence too bad and atomics are still faster?

With the atomic implementation you're guaranteed to have access contention because each element is going to be hitting the same output position to make the sum. I don't have good intuition if that's going to be better or worse than thread divergence.

@jpivarski maybe?

lgray commented 3 weeks ago

In any case - with this latest change I've now got query 4 done on the ADL benchmarks. The rest seem to require combinations, so I'll wait for that!

lgray commented 2 weeks ago

Ah, indeed, looking at how you've implemented atomics, there's already effectively thread divergence built into your kernel. So, I think we can get less thread divergence if we drop atomics for sum, when it is along the last axis, which is a very common case worth optimizing.

ManasviGoyal commented 2 weeks ago

Ah, indeed, looking at how you've implemented atomics, there's already effectively thread divergence built into your kernel. So, I think we can get less thread divergence if we drop atomics for sum, when it is along the last axis, which is a very common case worth optimizing.

Thanks for the feedback. You are right. The conditions checking for the boundry condition are causing divergence. I will try some other approaches to handle the block boundary case as we need to make sure the sum is carried over across bock for each parent. I was thinking of using another reduction step to replace the atomics. I'll update once I have an implementation.

lgray commented 2 weeks ago

That being said I wouldn't throw the baby out with the bath water and delay merging this if you are otherwise happy. Performance can be won over time.

ManasviGoyal commented 2 weeks ago

That being said I wouldn't throw the baby out with the bath water and delay merging this if you are otherwise happy. Performance can be won over time.

Okay, I agree. Then it would be better to finalize and merge this with the current implementation for now. I can work on implementing other important kernels and get back to these once I am done.

ManasviGoyal commented 2 weeks ago

@ianna Should I close #3123 or can this be merged? This just contains some studies I did for reducers.

ManasviGoyal commented 2 weeks ago

@ManasviGoyal - I'm not sure if it's a regression or my configuration. Do you see these tests pass? Thanks!

tests-cuda-kernels-explicit 5895 passed in 8.95s tests-cuda-kernels 6448 passed, 193 skipped in 8.06s tests-cuda 48 failed, 522 passed in 12.89s

========================================= short test summary info ==========================================
FAILED tests-cuda/test_3140_cuda_jagged_and_masked_getitem.py::test_0111_jagged_and_masked_getitem_bitmaskedarray2b - assert [[0.0, 1.1, 2...e, None, None] == [[0.0, 1.1, 2....7, 8.8, 9.9]]
FAILED tests-cuda/test_3140_cuda_jagged_and_masked_getitem.py::test_0111_jagged_and_masked_getitem_bytemaskedarray2b - assert [[], [], [], []] == [[1.1, 1.1], ...], [6.6, 9.9]]
FAILED tests-cuda/test_3140_cuda_jagged_and_masked_getitem.py::test_0111_jagged_and_masked_getitem_emptyarray - assert [[], [], [], []] == [[], [None], [], []]
FAILED tests-cuda/test_3140_cuda_jagged_and_masked_getitem.py::test_0111_jagged_and_masked_getitem_indexedarray - cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 45,035,996,273,704,960 bytes (allocated so ...
FAILED tests-cuda/test_3140_cuda_jagged_and_masked_getitem.py::test_0111_jagged_and_masked_getitem_indexedarray2 - AssertionError: broadcast_tooffsets64 can only be used with offsets that start at 0, not 257
FAILED tests-cuda/test_3140_cuda_jagged_and_masked_getitem.py::test_0111_jagged_and_masked_getitem_indexedarray2b - cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 34,224,996,864 bytes (allocated so far: 21,...
FAILED tests-cuda/test_3140_cuda_jagged_and_masked_getitem.py::test_0111_jagged_and_masked_getitem_indexedarray3 - assert [4.4, None, 1.1] == [4.4, None, 2.2]
FAILED tests-cuda/test_3140_cuda_jagged_and_masked_getitem.py::test_0111_jagged_and_masked_getitem_jagged - assert [[], [], [], [], []] == [[1.1, 3.3], ....8, 8.8, 7.7]]
FAILED tests-cuda/test_3140_cuda_jagged_and_masked_getitem.py::test_0111_jagged_and_masked_getitem_double_jagged - assert [[], []] == [[[2, 1, 0], ... 10, 10, 12]]]
FAILED tests-cuda/test_3140_cuda_jagged_and_masked_getitem.py::test_0111_jagged_and_masked_getitem_masked_jagged - assert [[], [], [], [], []] == [[3.3, 2.2], ...e, [8.8, 7.7]]
FAILED tests-cuda/test_3140_cuda_jagged_and_masked_getitem.py::test_0111_jagged_and_masked_getitem_jagged_masked - cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 8,796,227,240,448 bytes (allocated so far: ...
FAILED tests-cuda/test_3140_cuda_jagged_and_masked_getitem.py::test_0111_jagged_and_masked_getitem_array_boolean_to_int - assert [[3, -1, 2], ... [0, 0, 0, 0]] == [[0, 1, 2], [... [0, 1, 2, 3]]
FAILED tests-cuda/test_3140_cuda_jagged_and_masked_getitem.py::test_0111_jagged_and_masked_getitem_array_slice - assert [0.0, 2.2, 2....5.5, 9.9, ...] == [5.5, 2.2, 2....9.9, 0.0, ...]
FAILED tests-cuda/test_3140_cuda_jagged_and_masked_getitem.py::test_0111_jagged_and_masked_getitem_jagged_mask - assert [[], [], [], [], []] == [[1.1, 2.2, 3....7, 8.8, 9.9]]
FAILED tests-cuda/test_3140_cuda_jagged_and_masked_getitem.py::test_0111_jagged_and_masked_getitem_jagged_missing_mask - assert [[], [], [1.1, 1.1, 1.1]] == [[1.1, 2.2, 3...], [4.4, 5.5]]
FAILED tests-cuda/test_3140_cuda_jagged_and_masked_getitem.py::test_0111_jagged_and_masked_getitem_masked_of_jagged_of_whatever - ValueError: Negative dimensions are not allowed
FAILED tests-cuda/test_3140_cuda_jagged_and_masked_getitem.py::test_0111_jagged_and_masked_getitem_missing - assert [3.3, 3.3, 3.3, 3.3, 3.3, 3.3] == [3.3, 6.6, No...one, 8.8, 6.6]
FAILED tests-cuda/test_3140_cuda_jagged_and_masked_getitem.py::test_0111_jagged_and_masked_getitem_new_slices - assert [1, 0, None, 9, 3, None, ...] == [5, 2, None, 3, 9, None, ...]
FAILED tests-cuda/test_3140_cuda_jagged_and_masked_getitem.py::test_0111_jagged_and_masked_getitem_record - ValueError: Negative dimensions are not allowed
FAILED tests-cuda/test_3140_cuda_jagged_and_masked_getitem.py::test_0111_jagged_and_masked_getitem_records_missing - AssertionError: assert [{'x': 3, 'y'... 0, 'y': 0.0}] == [{'x': 3, 'y'... 7, 'y': 7.7}]
FAILED tests-cuda/test_3140_cuda_jagged_and_masked_getitem.py::test_0111_jagged_and_masked_getitem_regular_regular - assert [[[], [], [], [], []], []] == [[[2], [6, 8]...[25, 27, 29]]]
FAILED tests-cuda/test_3140_cuda_jagged_and_masked_getitem.py::test_0111_jagged_and_masked_getitem_sequential - assert [[], []] == [[[10, 11, 12... 17, 18, 19]]]
FAILED tests-cuda/test_3140_cuda_jagged_and_masked_getitem.py::test_0111_jagged_and_masked_getitem_union_2 - assert [[], [], [], [], [], [], ...] == [[1.1, 3.3], ...8.8], [], ...]
FAILED tests-cuda/test_3140_cuda_slicing.py::test_0315_integerindex_null_more - ValueError: Negative dimensions are not allowed
FAILED tests-cuda/test_3140_cuda_slicing.py::test_0315_integerindex_null_more_2 - ValueError: Negative dimensions are not allowed
FAILED tests-cuda/test_3140_cuda_slicing.py::test_1405_slicing_untested_cases_list_option_list - IndexError: cannot slice ListArray (of length 0) with [[], [0]]: cannot fit jagged slice with length 2 ...
FAILED tests-cuda/test_3140_cuda_slicing.py::test_1405_slicing_untested_cases_list_option_list_offset - IndexError: cannot slice ListArray (of length 0) with [[], [0]]: cannot fit jagged slice with length 2 ...
FAILED tests-cuda/test_3140_cuda_slicing.py::test_1502_getitem_jagged_issue1406 - IndexError: cannot slice ListArray (of length 0) with [[], [0]]: cannot fit jagged slice with length 2 ...
FAILED tests-cuda/test_3140_cuda_slicing.py::test_1502_getitem_jagged_issue1406_success_start_offset0 - IndexError: cannot slice ListArray (of length 0) with [[], [0]]: cannot fit jagged slice with length 2 ...
FAILED tests-cuda/test_3140_cuda_slicing.py::test_1502_getitem_jagged_issue1406_success_remove_option_type - assert [[[], []]] == [[[], [2]]]
FAILED tests-cuda/test_3140_cuda_slicing.py::test_1502_getitem_jagged_issue1406_success_nonempty_list - IndexError: cannot slice ListArray (of length 128) with [[0], [0]]: cannot fit jagged slice with length...
FAILED tests-cuda/test_3140_cuda_slicing.py::test_1904_drop_none_ListArray_and_axis_None - AssertionError: broadcast_tooffsets64 can only be used with offsets that start at 0, not 1
FAILED tests-cuda/test_3140_cuda_slicing.py::test_1904_drop_none_ListOffsetArray_IndexedOptionArray_NumpyArray_outoforder - assert [[6.6, None, ... [4.4], [4.4]] == [[0.0, None, ... [4.4], [2.2]]
FAILED tests-cuda/test_3140_cuda_slicing.py::test_1904_drop_none_from_iter - assert [[], []] == [[1], [2]]
FAILED tests-cuda/test_3140_cuda_slicing.py::test_1904_drop_none_List_ByteMaskedArray_NumpyArray - ValueError: Negative dimensions are not allowed
FAILED tests-cuda/test_3140_cuda_slicing.py::test_1904_drop_none_RegularArray_RecordArray_NumpyArray - assert [[[3.3, None,....6, 7.7], []]] == [[[0.0, None,....7, 8.8], []]]
FAILED tests-cuda/test_3140_cuda_slicing.py::test_1904_drop_none_RecordArray - ValueError: Negative dimensions are not allowed
FAILED tests-cuda/test_3140_cuda_slicing.py::test_2246_slice_not_packed - assert [[], []] == [[0], [3, 4]]
FAILED tests-cuda/test_3140_cuda_slicing.py::test_0127_tomask_operation_ByteMaskedArray_jaggedslice0 - assert [[], [], [], []] == [[0.0, 1.1, 2....7, 8.8, 9.9]]
FAILED tests-cuda/test_3140_cuda_slicing.py::test_0127_tomask_operation_ByteMaskedArray_jaggedslice1 - assert [[], [], [], [], []] == [[2.2, None, ..., [7.7, None]]
FAILED tests-cuda/test_3140_cuda_slicing.py::test_0127_tomask_operation_ByteMaskedArray_jaggedslice2 - ValueError: Negative dimensions are not allowed
FAILED tests-cuda/test_3140_cuda_slicing.py::test_0127_tomask_operation_ByteMaskedArray_jaggedslice3 - assert [[[[]], [[]], [[]]], [[[]]]] == [[[[2.2, None...[7.7, None]]]]
FAILED tests-cuda/test_3140_cuda_slicing.py::test_0127_tomask_operation - assert [[3.3, 0.0, 6...0], [], [0.0]] == [[0.0, 1.1, 2....7, 8.8, 9.9]]
FAILED tests-cuda/test_3141_cuda_misc.py::test_0150_ByteMaskedArray_flatten - assert [[1.1, 12.2, ...3], [], [4.4]] == [[0.0, 1.1, 2..., 11.1, 12.2]]
FAILED tests-cuda/test_3141_cuda_misc.py::test_1586_should_preserve_regulararray_numpy_regular_axis1 - assert [[0.0, 0.0, 0....7, 8.8, 9.9]] == [[0.0, 1.1, 4....7, 8.8, 9.9]]
FAILED tests-cuda/test_3141_cuda_misc.py::test_1586_should_preserve_regulararray_regular_numpy_axis1 - assert [[5.5, 4.4, 4....4, 8.8, 6.6]] == [[0.0, 1.1, 4....7, 8.8, 9.9]]
FAILED tests-cuda/test_3141_cuda_misc.py::test_1586_should_preserve_regulararray_regular_regular_axis1 - assert [] == [[0.0, 1.1, 4....7, 8.8, 9.9]]
FAILED tests-cuda/test_3141_cuda_misc.py::test_0590_allow_regulararray_size_zero_ListOffsetArray_rpad_and_clip - assert [[], [], []] == [[1, 2, 3], [], [4, 5]]
===================================== 48 failed, 522 passed in 12.89s ======================================

I think I see the potential issue. It's possibly related to memory access in one (or more) of the reducers which is causing all the subsequent tests to fail. I'll investigate and fix it. Thanks!

ianna commented 1 week ago

I'm checking it with https://github.com/scikit-hep/awkward/pull/3158

ianna commented 1 week ago

I'm checking it with #3158 and https://github.com/scikit-hep/awkward/pull/3159

I've opened an issue: https://github.com/scikit-hep/awkward/issues/3160

ManasviGoyal commented 1 week ago

@ManasviGoyal - all tests pass on my local computer with the updated branch with the import of numpy! If it works fine on yours the PR is good to be merged. Thanks.

@ianna All MacOS tests are cancelled in the CI due to which I am unable to merge.

ianna commented 1 week ago

@jpivarski - I think, we need some other macOS node:

Run Tests (macos-11, 3.8, x64, full) This is a scheduled macOS-11 brownout. The macOS-11 environment is deprecated and will be removed on June 28th, 2024.

ManasviGoyal commented 6 days ago

This is excellent!!! As I understand it, this enables all axis=-1 reducers, with tests for crossing block boundaries. As we talked about in our meeting, it could have more tests of block boundary crossing and integration tests (converted from tests to tests-cuda, particularly test_0115_generic_reducer_operation.py).

@jpivarski #3162 adds all the axis=-1 tests in test_0115_generic_reducer_operation.py for cuda. I have also added tests to check for block boundary for array size = 3000 (primes for ak.prod) as we discussed in the last meeting. Thanks!

scikit-hep / awkward

feat: add reduce kernels #3136

Run Tests (macos-11, 3.8, x64, full) This is a scheduled macOS-11 brownout. The macOS-11 environment is deprecated and will be removed on June 28th, 2024.