chore: trying atomics and tree reduction for CUDA reducer kernels

ManasviGoyal commented 1 month ago

Kernels tested for different block sizes

[x] awkward_reduce_argmax
[x] awkward_reduce_argmin
[x] awkward_reduce_count_64
[x] awkward_reduce_countnonzero
[x] awkward_reduce_max
[x] awkward_reduce_min
[x] awkward_reduce_prod_bool
[x] awkward_reduce_sum
[x] awkward_reduce_sum_bool
[x] awkward_reduce_sum_int32_bool_64
[x] awkward_reduce_sum_int64_bool_64
[ ] awkward_reduce_argmax_complex
[ ] awkward_reduce_argmin_complex
[ ] awkward_reduce_countnonzero_complex
[x] awkward_reduce_max_complex
[x] awkward_reduce_min_complex
[x] awkward_reduce_prod
[ ] awkward_reduce_prod_bool_complex
[x] awkward_reduce_prod_complex
[ ] awkward_reduce_sum_bool_complex
[x] awkward_reduce_sum_complex

lgray commented 3 weeks ago

@ManasviGoyal I'm working on implementing https://github.com/CoffeaTeam/coffea-benchmarks/blob/master/coffea-adl-benchmarks.ipynb as much as possible in cuda kernels to start benchmarking realistic throughput.

We've already put together cupy based histograms that conform to HEP expectations, so we can nominally do full analysis workflows on the GPU.

@nsmith- will be working on a first-try at uproot-on-GPU using DMA over PCI-express.

I'll be working on a mock-up using parquet and cudf so we can understand the full workload's performance.

The first thing we're missing is the ability to slice arrays, which I understand from talking to Jim is intertwined with the reducer implementation. I'm happy to help test things in realistic use cases when you have implementations ready. Keep us in the loop and we'll be responsive!

ManasviGoyal commented 3 weeks ago

@ManasviGoyal I'm working on implementing https://github.com/CoffeaTeam/coffea-benchmarks/blob/master/coffea-adl-benchmarks.ipynb as much as possible in cuda kernels to start benchmarking realistic throughput.

We've already put together cupy based histograms that conform to HEP expectations, so we can nominally do full analysis workflows on the GPU.

@nsmith- will be working on a first-try at uproot-on-GPU using DMA over PCI-express.

I'll be working on a mock-up using parquet and cudf so we can understand the full workload's performance.

The first thing we're missing is the ability to slice arrays, which I understand from talking to Jim is intertwined with the reducer implementation. I'm happy to help test things in realistic use cases when you have implementations ready. Keep us in the loop and we'll be responsive!

Sure. I'll keep you updated. I am still figuring out how to handle some cases for reducers. Is there any specific kernels that you need first for slicing? I can prioritize them. The best was to test would be writing the test with arrays in cuda backend and see what error message it gives you. It would give you the name of the missing kernel that is needed for the function.

lgray commented 3 weeks ago

I only have access to virtualized GPUs (they are MIG-partitioned a100s at Fermilab) and for some reason instead of giving me an error it hangs forever! So that's a bit of a show stopper on my side.

As highest priority we would need boolean slicing and then as next highest priority we would need index-based slicing. After that we'll need argmin and argmax on the reducer side!

lgray commented 3 weeks ago

If you have a FNAL computing account I can help you reproduce the failure mode I am seeing.

ManasviGoyal commented 3 weeks ago

If you have a FNAL computing account I can help you reproduce the failure mode I am seeing.

I don't have a FNAL computing account. But for the current state it should give you a "kernel not implemented error". If you get any other errors, then the error might be because of a different reason. Maybe you can open an issue explaining the steps to reproduce and the error and I can check that on my GPU.

lgray commented 3 weeks ago

The major problem blocking a common a simple reproducer is that it involves setting up kubernetes and mounting a MIG-partitioned virtualized GPU into a container in order to get the faulty behavior. Some of these configuration options are not possible with a consumer GPU (particularly MIG partitioning), and I have no idea which component is causing the problem.

Do you have access to a cluster with such a setup through other means?

jpivarski commented 3 weeks ago

I thought the error we were talking about was just slicing ragged arrays:

>>> import awkward as ak
>>> array = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]], backend="cuda")
>>> array > 3
<Array [[False, False, True], [], [True, True]] type='3 * var * bool'>
>>> array[array > 3]

although this does give the expected "kernel not found" error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/irishep/awkward/src/awkward/highlevel.py", line 1065, in __getitem__
    prepare_layout(self._layout[where]),
                   ~~~~~~~~~~~~^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/content.py", line 512, in __getitem__
    return self._getitem(where)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/content.py", line 565, in _getitem
    return self._getitem(where.layout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/content.py", line 640, in _getitem
    return self._getitem((where,))
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/content.py", line 557, in _getitem
    out = next._getitem_next(nextwhere[0], nextwhere[1:], None)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/regulararray.py", line 698, in _getitem_next
    down = self._content._getitem_next_jagged(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/listoffsetarray.py", line 423, in _getitem_next_jagged
    return out._getitem_next_jagged(slicestarts, slicestops, slicecontent, tail)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/listarray.py", line 546, in _getitem_next_jagged
    self._backend[
  File "/home/jpivarski/irishep/awkward/src/awkward/_backends/cupy.py", line 43, in __getitem__
    raise AssertionError(f"CuPyKernel not found: {index!r}")
AssertionError: CuPyKernel not found: ('awkward_ListArray_getitem_jagged_apply', <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>)

I can ssh into pivarski@cmslpc-el8.fnal.gov and pivarski@cmslpc-el9.fnal.gov. Is this something that I could run and at least find the relevant parts for @ManasviGoyal to investigate?

ManasviGoyal commented 3 weeks ago

@lgray I have started working on slicing kernels along with reducers so that you can start testing. I tested the example @jpivarski gave in #3140 and it works. You can try and see if this simple example works in your GPU. There is one more slicing kernel that is left now (I still need to test the ones I have added more extensively). I will add rest of the kernels you mentioned as soon as possible.

>>> import awkward as ak
>>> array = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]], backend="cuda")
>>> array > 3
<Array [[False, False, True], [], [True, True]] type='3 * var * bool'>
>>> array[array > 3]
<Array [[3.3], [], [4.4, 5.5]] type='3 * var * float64'>

lgray commented 3 weeks ago

awesome!

lgray commented 3 weeks ago

Ah - also - combinations / argcombinations is relatively high priority as well.

lgray commented 3 weeks ago

@jpivarski the error I was talking about should be that error. Instead the process hangs indefinitely with no error emitted.

lgray commented 3 weeks ago

@ManasviGoyal to make the prioritization a little bit more clear, you can use this set of analysis functionality benchmarks: https://github.com/CoffeaTeam/coffea-benchmarks/blob/master/coffea-adl-benchmarks.ipynb

These are what I am currently using to see what's possible on GPU.

Since this PR isn't merged in yet, and we found some other issues today, I'm currently just finished with Query 3. Query 4 requires this PR since it contains a reduction that you've already implemented.

You can more or less look for the various awkward operations in these functionality tests and prioritize by that ordering what is needed!

ManasviGoyal commented 3 weeks ago

@ManasviGoyal to make the prioritization a little bit more clear, you can use this set of analysis functionality benchmarks: https://github.com/CoffeaTeam/coffea-benchmarks/blob/master/coffea-adl-benchmarks.ipynb

These are what I am currently using to see what's possible on GPU.

Since this PR isn't merged in yet, and we found some other issues today, I'm currently just finished with Query 3. Query 4 requires this PR since it contains a reduction that you've already implemented.

You can more or less look for the various awkward operations in these functionality tests and prioritize by that ordering what is needed!

Thanks! This helps a lot in prioritizing the kernels. I'll finish up with all the reducers soon and start combinations.

lgray commented 3 weeks ago

I was also trying to check only this PR on the sum memory usage I brought up over in #3136 but it seems it's not actually implemented yet here. Looking in the files for awkward_reduce_sum_int32_bool_64 and awkward_reduce_sum_int64_bool_64 it seems those only reimplement awkward_reduce_countnonzero, probably just a simple mistake.

But I can't really proceed with checking due to that.

ianna commented 1 week ago

@ManasviGoyal - what is the status of this PR? Are you working on it? Thanks!

ManasviGoyal commented 1 week ago

@ManasviGoyal - what is the status of this PR? Are you working on it? Thanks!

This just includes some studies I did for reducers. It can either be merger into studies directory or I can just close it. But I will still check on Monday if there is anything pending to be pushed.

jpivarski commented 1 week ago

But I will still check on Monday if there is anything pending to be pushed.

I'll wait until you're done with that.

It has to be merged by me because it won't pass the "merge without waiting for requirements to be met." The tests don't run if there is no change to the code, which is the case here because it only affects studies.

Meanwhile, I'll bring this up to the present, though.

jpivarski commented 5 days ago

It looks like you didn't have any updates on Monday (and if so, they can be a new PR), so I'll merge this into studies now.

lgray commented 5 days ago

🎉 🚀

ManasviGoyal commented 5 days ago

It looks like you didn't have any updates on Monday (and if so, they can be a new PR), so I'll merge this into studies now.

Yes. There were no other commits to be pushed. Thanks!

scikit-hep / awkward

chore: trying atomics and tree reduction for CUDA reducer kernels #3123