Closed jaycedowell closed 1 year ago
I think it works but I'm not super happy with the whole raise bf.device.GraphCreatedError
part. I'm open to suggestions of how to restructure things.
Merging #183 (e359f33) into master (a4be8b5) will increase coverage by
0.49%
. The diff coverage is78.57%
.
@@ Coverage Diff @@
## master #183 +/- ##
==========================================
+ Coverage 66.81% 67.30% +0.49%
==========================================
Files 69 69
Lines 7410 7543 +133
==========================================
+ Hits 4951 5077 +126
- Misses 2459 2466 +7
Impacted Files | Coverage Δ | |
---|---|---|
python/bifrost/libbifrost.py | 71.11% <23.07%> (-8.11%) |
:arrow_down: |
python/bifrost/device.py | 87.14% <91.22%> (+15.71%) |
:arrow_up: |
python/bifrost/libbifrost_generated.py | 73.28% <0.00%> (+0.43%) |
:arrow_up: |
python/bifrost/fft.py | 100.00% <0.00%> (+4.54%) |
:arrow_up: |
python/bifrost/ndarray.py | 89.60% <0.00%> (+6.09%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update c866fd2...e359f33. Read the comment docs.
I might be happy with 000fc53. I'll update the description.
Maybe it's not ready with the latest set of changes...
Even though e359f33 seems like it will work it has problems. You can still end up in a situation where the host callback data structures are overwritten by a subsequent bifrost.fft.Fft
call if the same object is used. Think forward transform followed by an inverse transform. There isn't a synch.-less way I can think of to ensure that all of the first call has been processed before the second call is scheduled.
Graphs are starting to look like the wrong solution to the problem I was trying to solve.
cuFFTDx might make a difference for this.
This PR adds support for using CUDA graphs inside a Bifrost block using a new
bifrost.device.Graph
context manager.Graph
is used as:For the loop over
i
:i=0
: Entering thegraph
context manager does nothing to allow forbifrost.map
and other things that need initialization to run without interacting with the graph creation.i=1
: Entering thegraph
context manager launches acudaStreamBeginCapture
call. Whengraph.__exit__()
is called when leaving the context the stream capture ends and an executable CUDA graph is created.i>=2
: Subsequent iterations overi
then jump over the block body with theif
statement and the the context manager executes the CUDA graph at__exit__()
.This should help reduce execution overheads on subsequent calls of a loop body. The only caveat I've found is that calls to a stream synch. are not allowed inside a graph. This means that the standard
bifrost.ndarray.copy_array
function cannot be used when the the two arrays are in different memory spaces. To work around thatGraph
has acopy_array
method which does the same thing as itsbifrost.ndarray
counterpart but does not try to synchronize the stream. This is part of the reason for the call tostream_synchronize()
after exiting thegraph
context.This PR should be merged after #167.