Closed jaycedowell closed 2 years ago
I thought this was working but I'm adding additional tests to test_ndarray
and now it doesn't look like it.
Everything seems to checkout now with 5990366. The only thing would be if there was a more efficient way to implement copy
.
It looks like the skip-duplicate-actions is working out. They still show up as jobs, but I guess it's easy enough to click onto the real ones. (From what I was reading, it's hard to eliminate the job entirely as long as you want to run for push-to-branch by maintainers but also new-pull-request from outsiders.)
I was thinking it might also help to simplify the build matrix, to be more judicious or selective about what each configuration "brings to the table" rather than just multiply everything by everything. We want:
--enable
flags or #ifdef
sBut every possible combination is probably not productive. If we push something that works with python 3.8 but causes an issue with 3.6, it's helpful to know that – but the issue is probably consistent across most of the other permutations.
If I had to guess what combinations would be useful:
I don't know that I've come up with a good way to roll different CUDA versions in with the self hosted runner. The best I've been able to come up with is to alternate the container that is used between self hosted runs. It's like testing multiple versions of CUDA but with unknown combinations of other parameters.
different CUDA versions in with the self hosted runner
Might actually be something that nix can help with. Through nixpkgs, I can install cuda-X.Y and build with it – even on my GPU-less thinkpad. You just can't query the hardware (so configure needs its --with-gpu-archs
). Hypothetically one could even do those builds on GitHub CI machines. (The built artifacts won't run without GPU however.) I haven't tried doing GPU builds on GitHub yet mainly because the download from nvidia can take forever, it's a many-GB package for each one. The workflow feature to cache stuff from one build to the next could help, but I'm not sure what the size limitations are.
But on self-hosted server, maybe it would work. Can you describe more about how self-hosted works? It's building on a standard ubuntu docker image but with GPU pass-through? Can we select which image to use, and is that per-build or per-repo? I can create a branch to try out some small experiments with self-hosted.
Yes, it's based on a standard Ubuntu image with CUDA 11 installed. It's setup as an ephemeral runner so you get only one workflow/test per container launch. I have cron job that makes sure there is always a container running. Everything that makes it work is on the test machine in ~jdowell/selfhosted/
. The biggest thing that I think it is missing right now, apart from the possibility of different CUDA versions, is limiting its access to only a single GPU.
based on a standard Ubuntu image with CUDA 11 installed
I see it now. So this means that installation of dependencies can be sort of partitioned between those specified in this Dockerfile (which would transcend different builds) and those in the main.yml
(which are repeated with each build).
And further, if we wanted any additional data/packages to be cached or imported locally, we could do that via COPY or even VOLUME in the Dockerfile.
That's right. The Dockerfile gives us the base system+CUDA that is ~constant between tests and then main.yml provides the test environment (at least the Python portion/dependencies) as well as the tests themselves. I will say that I'm a little nervous about using too much COPY or VOLUME to expose things to Docker. This is partially because of wanting to isolate the container as much as possible and partially because I had some problems using long-lived containers running multiple tests when I was first setting this up.
Everything works (again)!
Merging #174 (53cb1a0) into master (317d653) will increase coverage by
8.44%
. The diff coverage is90.90%
.:exclamation: Current head 53cb1a0 differs from pull request most recent head a4be8b5. Consider uploading reports for the commit a4be8b5 to get more accurate results
@@ Coverage Diff @@
## master #174 +/- ##
==========================================
+ Coverage 58.37% 66.81% +8.44%
==========================================
Files 67 69 +2
Lines 5744 7410 +1666
==========================================
+ Hits 3353 4951 +1598
- Misses 2391 2459 +68
Impacted Files | Coverage Δ | |
---|---|---|
python/bifrost/ndarray.py | 83.51% <90.90%> (+10.36%) |
:arrow_up: |
python/bifrost/version/__init__.py | 100.00% <0.00%> (ø) |
|
python/bifrost/libbifrost_generated.py | 72.85% <0.00%> (ø) |
|
python/bifrost/ring2.py | 85.71% <0.00%> (+0.25%) |
:arrow_up: |
python/bifrost/pipeline.py | 84.87% <0.00%> (+1.17%) |
:arrow_up: |
python/bifrost/DataType.py | 66.89% <0.00%> (+2.02%) |
:arrow_up: |
python/bifrost/blocks/transpose.py | 96.87% <0.00%> (+3.12%) |
:arrow_up: |
python/bifrost/memory.py | 78.94% <0.00%> (+3.50%) |
:arrow_up: |
python/bifrost/libbifrost.py | 79.22% <0.00%> (+23.37%) |
:arrow_up: |
... and 15 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 317d653...a4be8b5. Read the comment docs.
It'd be really nice if we could get the codecov report off the self hosted runner but I have yet to figure out how to do that.
Definitely, we want to see the coverage reported by a GPU runner. I'll toy with that a little today.
It looks like it is uploading, but from multiple platforms so the self-hosted one probably doesn't "win". I'll try some stuff, but apologies if it revisits what you already figured out. :)
Ah, now I see how env_vars
works and where to find the reports per OS
: https://codecov.io/gh/ledatelescope/bifrost/commit/35d9f702414de9feff2f234a22267f25d90f069c/build
So it's getting all three, I'm just not sure how/whether it's merging them together for the front-page numbers that it shows.
I downloaded the per-OS files on the build page. It does show that self-hosted has an increment higher than the other two. I don't know why they don't exactly match the final numbers in the interface:
All of those report files consider lines-valid to be 7410. Then the interface shows the project totals ratio to be 3364/5766 = 0.5834 (close to, but not exactly the same as the ubuntu/macos number).
ubuntu.xml:298:<coverage version="6.3.2" timestamp="1650589912156" lines-valid="7410" lines-covered="4472" line-rate="0.6035" branches-covered="0" branches-valid="0" branch-rate="0" complexity="0">
selfhost.xml:483:<coverage version="6.3.2" timestamp="1650591784082" lines-valid="7410" lines-covered="4951" line-rate="0.6682" branches-covered="0" branches-valid="0" branch-rate="0" complexity="0">
macos.xml:298:<coverage version="6.3.2" timestamp="1650590279525" lines-valid="7410" lines-covered="4472" line-rate="0.6035" branches-covered="0" branches-valid="0" branch-rate="0" complexity="0">
I had started out with only trying to have one coverage report from self-hosted and Py3.8 but every time I did that I ended up with some kind of "missing commit"-style message from codecov. That's why I switched to trying multiple reports. If multiple reports is the only answer then maybe we need a codecov YAML file so that we can set the notify after_n_builds option. Maybe that would force some kind of aggregation.
I see the error now. Pretty non-specific, but there's a good chance it does have to do with the source code paths. I noticed in the XML-ish files I downloaded that paths were pretty different:
Ubuntu:
<class name="fdmt.py" filename="/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/bifrost/fdmt.py" complexity="0" line-rate="0.5" branch-rate="0">
Mac:
<class name="fdmt.py" filename="/Users/runner/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/bifrost/fdmt.py" complexity="0" line-rate="0.5" branch-rate="0">
Self-hosted:
<class name="fdmt.py" filename="/home/docker/actions-runner/_work/_tool/Python/3.8.12/x64/lib/python3.8/site-packages/bifrost/fdmt.py" complexity="0" line-rate="1" branch-rate="0">
They're all different, but maybe it knows how to deal with the first two. Will try their advice about fixing up the paths in the codecov yml.
That worked!
This PR has two parts: 1) Add a new "build and test" workflow target that uses the new Bifrost test machine to run GPU testing. 2) Make
bifrost.ndarray.copy
more robust with arrays that are not C contiguous, i.e., arrays that have been transposed with a stride manipulation.(1) is largely self-explanatory but (2) has a few implementation details. For (2) there is now a check for this particular non-contiguous condition in
copy
and Bifrost tries to do the right thing:bfTranpose
is used to reorder the data. This reorded data goes into an intermediate Bifrost array before being copied to a Bifrost array in the correct space.This implementation seems to work but it is copy heavy with all of the intermediates. There is also a new test in
test_ndarray
that checks this particular condition for system-accessible arrays.