jaycedowell commented 2 years ago

This PR adds support for using CUDA graphs inside a Bifrost block using a new bifrost.device.Graph context manager. Graph is used as:

import bifrost as bf
graph = bf.device.Graph()
for i in range(100):
    with graph:
        if not graph.created:
            <some GPU stuff here>
    bf.device.stream_synchronize()

For the loop over i:

i=0: Entering the graph context manager does nothing to allow for bifrost.map and other things that need initialization to run without interacting with the graph creation.
i=1: Entering the graph context manager launches a cudaStreamBeginCapture call. When graph.__exit__() is called when leaving the context the stream capture ends and an executable CUDA graph is created.
i>=2: Subsequent iterations over i then jump over the block body with the if statement and the the context manager executes the CUDA graph at __exit__().

This should help reduce execution overheads on subsequent calls of a loop body. The only caveat I've found is that calls to a stream synch. are not allowed inside a graph. This means that the standard bifrost.ndarray.copy_array function cannot be used when the the two arrays are in different memory spaces. To work around that Graph has a copy_array method which does the same thing as its bifrost.ndarray counterpart but does not try to synchronize the stream. This is part of the reason for the call to stream_synchronize() after exiting the graph context.

This PR should be merged after #167.

jaycedowell commented 2 years ago

I think it works but I'm not super happy with the whole raise bf.device.GraphCreatedError part. I'm open to suggestions of how to restructure things.

codecov-commenter commented 2 years ago

Codecov Report

Merging #183 (e359f33) into master (a4be8b5) will increase coverage by 0.49%. The diff coverage is 78.57%.

@@            Coverage Diff             @@
##           master     #183      +/-   ##
==========================================
+ Coverage   66.81%   67.30%   +0.49%     
==========================================
  Files          69       69              
  Lines        7410     7543     +133     
==========================================
+ Hits         4951     5077     +126     
- Misses       2459     2466       +7

Impacted Files	Coverage Δ
python/bifrost/libbifrost.py	`71.11% <23.07%> (-8.11%)`	:arrow_down:
python/bifrost/device.py	`87.14% <91.22%> (+15.71%)`	:arrow_up:
python/bifrost/libbifrost_generated.py	`73.28% <0.00%> (+0.43%)`	:arrow_up:
python/bifrost/fft.py	`100.00% <0.00%> (+4.54%)`	:arrow_up:
python/bifrost/ndarray.py	`89.60% <0.00%> (+6.09%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c866fd2...e359f33. Read the comment docs.

jaycedowell commented 2 years ago

I might be happy with 000fc53. I'll update the description.

jaycedowell commented 2 years ago

Maybe it's not ready with the latest set of changes...

jaycedowell commented 2 years ago

Even though e359f33 seems like it will work it has problems. You can still end up in a situation where the host callback data structures are overwritten by a subsequent bifrost.fft.Fft call if the same object is used. Think forward transform followed by an inverse transform. There isn't a synch.-less way I can think of to ensure that all of the first call has been processed before the second call is scheduled.

Graphs are starting to look like the wrong solution to the problem I was trying to solve.

jaycedowell commented 1 year ago

cuFFTDx might make a difference for this.

ledatelescope / bifrost

Support for CUDA graphs #183

Codecov Report