Closed jaycedowell closed 1 year ago
Merging #167 (96d99ac) into master (f82ae97) will increase coverage by
0.24%
. The diff coverage is68.75%
.
@@ Coverage Diff @@
## master #167 +/- ##
==========================================
+ Coverage 66.76% 67.00% +0.24%
==========================================
Files 69 69
Lines 7416 7464 +48
==========================================
+ Hits 4951 5001 +50
+ Misses 2465 2463 -2
Impacted Files | Coverage Δ | |
---|---|---|
python/bifrost/libbifrost.py | 71.11% <23.07%> (-8.11%) |
:arrow_down: |
python/bifrost/device.py | 81.63% <85.71%> (+10.20%) |
:arrow_up: |
python/bifrost/ndarray.py | 89.60% <0.00%> (+6.09%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update f82ae97...96d99ac. Read the comment docs.
Interesting. Is it worthwhile to code something like the cupy
example shown above as a test? Not that we need to add cupy as a test dependency… maybe can mock something. Just so as not to cede the coverage score. :) I'll play a little.
I was thinking about adding in some cupy tests yesterday. It would be good from a coverage perspective to be able to test that we can make biforst.ndarray
s from cupy.ndarray
s and vice versa. This bifrost/cupy interoperability would also be an interesting test case since it would also serve as an example of how to do this. The biggest thing that I see with the interoperability test is checking that the stream binding actually works. I was testing with nvprof
/nvvp
yesterday but that is way too heavy to put in. The other thing to keep in mind is that our coverage should improve once we have the dedicated test machine and we can run (and report) all of the CUDA-specific tests and paths.
Sort of related to this, if you look in bifrost.ndarray
we also have support for pycuda in there. I used to use pycuda but I've really shifted to cupy recently. It might be worth thinking about how important it is to provide interoperability with both or if one is better/easier/sufficient.
I did some reading on pycuda vs cupy today and I think the answer to my previous interoperability question is "yes". They are targeted to two different audiences with cupy being an easy way to move onto the GPU if you have a lot of numpy-based codes and pycuda is more about building things from scratch in a very CUDA way. In that regard how you can interface pycuda and Bifrost is limited to being able to move data between the array formats since I can't find an interface in pycuda that allows you to use an external thread. cupy, on the other hand, does have that feature and I use it in the example.
Assuming the tests pass, this all looks great to me. (I can probably try it later today.) And it makes a great case for interoperability with other frameworks. Will maybe want to tout that in release notes and documentation next time.
Could it be worthwhile to put BifrostStreamManager
into the library, rather than just the test file? You're providing get_stream
and set_stream
which can be thought of as a lower-level interface, but in most cases of Python programming it seems like calling them within a context manager is The Right Thing.
(Maybe if it's in the library, BifrostStreamManager
wants a more precise name? Stream substituter? Stream replacer? StreamRedirect?)
orig_stream
and BifrostStreamManager
were reactions to problems I was having in test_stream
. When another framework creates a stream, Bifrost can use it but we are subject to whatever lifetime was chosen for that stream. When I wasn't being careful I would not save the original Bifrost stream and running this test would make all other tests after it fail because the stream was gone.
Maybe the thing to do is use the terminology of cupy and call it something like ExternalStream
or UseExternalStream
?
I'm holding off on this to see if I can get a self-hosted runner going on the test machine. It sort of works but there are many mysteries.
Those really slow builds on self-hosted/Python3.[68] are all for this.
Those really slow builds on self-hosted/Python3.[68] are all for this.
Yeah I noticed that. It's just the build, not especially the tests that use it, right? Possible/desirable to pip-install it into the docker container ahead of time? Or is the runner python a totally different environment from the docker-image python?
I think it's the PyCUDA/CuPy install (mainly CuPy) that is the problem. I don't know that there is a way to pre-install it since we are using actions/setup-python@v2
to set the Python environment that we will use for the build.
One thing we could do is run two self-hosted runner at once, one per GPU.
Also maybe it's not necessary to include the CuPy tests for both 3.6 and 3.8?
The cupy-cuda112 wheel is helping. It dropped the "Software Install - Python, part 2" time down to ~1 min from ~12 min. If we keep using the wheel then we need to make sure we are using the correct one for whatever version of cuda we have in the Dockerfile.
We should be running two runners now, one on each GPU. I should test that.
Two runners gets it down to ~30 minutes. Another way to save time would be to drop the Py2.7 self-hosted test. That would mean giving up Py2.7 testing for the GPU related code.
I think I'm happy with that is in here now that there is the "per-thread default stream" flag in configure
. The only thing I'm tempted to continue working on is this seemly random test_matmul_aa_ci8
failure that we sometimes get. The full error is:
ERROR: test_matmul_aa_ci8 (test_linalg.TestLinAlg)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/docker/actions-runner/_work/bifrost/bifrost/test/test_linalg.py", line 302, in test_matmul_aa_ci8
self.run_test_matmul_aa_ci8_transpose(True)
File "/home/docker/actions-runner/_work/bifrost/bifrost/test/test_linalg.py", line [29](https://github.com/ledatelescope/bifrost/runs/6833550337?check_suite_focus=true#step:9:30)2, in run_test_matmul_aa_ci8_transpose
self.run_test_matmul_aa_ci8_shape((99+kp,3+kp), transpose=transpose)
File "/home/docker/actions-runner/_work/bifrost/bifrost/test/test_linalg.py", line 72, in run_test_matmul_aa_ci8_shape
self.linalg.matmul(1, a, None, 0, c)
File "/home/docker/actions-runner/_work/_tool/Python/2.7.18/x64/lib/python2.7/site-packages/bifrost/linalg.py", line 66, in matmul
c_array))
File "/home/docker/actions-runner/_work/_tool/Python/2.7.18/x64/lib/python2.7/site-packages/bifrost/libbifrost.py", line 139, in _check
raise RuntimeError(status_str)
RuntimeError: BF_STATUS_UNSUPPORTED_SHAPE
I haven't been able to reproduce that failure. Oh well.
@league do you think this one is ready to merge? I can't think of anything else I'd like to see as part of it.
With bbd4b67, I thought it might be useful for the config summary to show what happened with the stream-model setting, so it's not just buried in the scroll.
I thought about adding it to python -m bifrost.version --config
too, but for the moment it doesn't seem that it's directly passed into Python. (It's not in config.h
). But maybe that's fine.
Trying to test on cetus, I discovered another thing: permission denied errors on /dev/shm/bifrost
because it's owned by you. The top level of it is globally writable so it may work for my process just to create new stuff underneath, but it gets stuck on the permissions before trying that, I think. I wonder if we should use a UUID under there rather than the same global name for every user?
This happens on every CUDA test:
test_serialize_with_name_no_ringlets (test_serialize.SerializeTest) ... proclog.cpp:176 internal error:
filesystem error: cannot set permissions: Operation not permitted [/dev/shm/bifrost]
ERROR
▶ ls -ld /dev/shm/bifrost
drwxrwxrwx 3 jdowell jdowell 60 Aug 3 09:59 /dev/shm/bifrost/
This problem seems familiar (as in something I've fixed in the past) but looking back through the history the only thing I see is f2691ff. That's a different problem of cleaning up stale contents but has similar "stuff from other users" roots as this. I'm kind of surprised that it is even trying to change the permissions since world writable is what we want. Maybe we need to only try to make /dev/shm/bifrost
world writable if it is not already.
The UUID thing is interesting. What are you thinking with this? That there would be a per-user /dev/shm/bifrost-XXXXXX
directory?
Regarding /dev/shm/bifrost-UUID
, yeah I think so. At first I was thinking we could even reuse the telemetry ID if it's always loaded anyway. But I think any routines to load the telemetry state file are from the Python side, and this might be initiated from C++ side.
What's the number right beneath /dev/shm/bifrost
, is it a PID? What's left there now is 1329696. If that's a PID and it's meant to be cleaned up (except when it isn't) then probably best to just use /dev/shm/bifrost
world-writable (and don't choke on chmod if it's already world-writable but owned by someone else) but beneath that continue to use user-owned PID subdirs. ?
Ooh! Exciting security-hole opportunity, haha. proclog.cpp
uses make_dir
from fileutils.cpp
so I reread my comment near the top of that file.
/* NOTE: For convenience on systems with compilers without C++17 support, these
functions build a shell command and pass it to system(). The PATH argument is
not shell-quoted or otherwise sanitized, so only use with program constants,
not with data from command line or config files. */
Combine that with the fact that I was just proposing pulling an identifier out of a config file and use it in the filename passed to make_dir
. 🧨 🔥
Yeah, that number is the PID. Having everything in one place makes it easier to at least attempt to cleanup after a process died before it could do so itself. I guess that would be true of /dev/shm/bifrost-UUID
but you are limited to only cleaning up your own dead processes.
Looks like the permission denied issue is limited to HAVE_CXX_FILESYSTEM
. On the command line, this succeeds:
▶ mkdir -p -m 777 /dev/shm/bifrost
but std::filesystem::permissions
throws an exception.
Trying this, which is probably closer to what mkdir -p -m
does anyway.
bool created = std::filesystem::create_directories(path);
if(created) {
std::filesystem::permissions(path, (std::filesystem::perms) perms,
std::filesystem::perm_options::replace);
}
Using sudo
to nuke your /dev/shm/bifrost
so I can make sure the fix creates it correctly for me.
This PR exposes the
bfStreamGet
andbfStreamSet
methods through the Python API. This can help with interfacing Bifrost with other CUDA-aware packages, such ascupy
. For example:This PR also adds
set_stream
andget_stream
methods toBifrostObject
s that support them.