ledatelescope / bifrost

A stream processing framework for high-throughput applications.
BSD 3-Clause "New" or "Revised" License
66 stars 29 forks source link

Self-hosted GPU testing, plus make `bifrost.ndarray.copy` more robust #174

Closed jaycedowell closed 2 years ago

jaycedowell commented 2 years ago

This PR has two parts: 1) Add a new "build and test" workflow target that uses the new Bifrost test machine to run GPU testing. 2) Make bifrost.ndarray.copy more robust with arrays that are not C contiguous, i.e., arrays that have been transposed with a stride manipulation.

(1) is largely self-explanatory but (2) has a few implementation details. For (2) there is now a check for this particular non-contiguous condition in copy and Bifrost tries to do the right thing:

This implementation seems to work but it is copy heavy with all of the intermediates. There is also a new test in test_ndarray that checks this particular condition for system-accessible arrays.

jaycedowell commented 2 years ago

I thought this was working but I'm adding additional tests to test_ndarray and now it doesn't look like it.

jaycedowell commented 2 years ago

Everything seems to checkout now with 5990366. The only thing would be if there was a more efficient way to implement copy.

league commented 2 years ago

It looks like the skip-duplicate-actions is working out. They still show up as jobs, but I guess it's easy enough to click onto the real ones. (From what I was reading, it's hard to eliminate the job entirely as long as you want to run for push-to-branch by maintainers but also new-pull-request from outsiders.)

I was thinking it might also help to simplify the build matrix, to be more judicious or selective about what each configuration "brings to the table" rather than just multiply everything by everything. We want:

But every possible combination is probably not productive. If we push something that works with python 3.8 but causes an issue with 3.6, it's helpful to know that – but the issue is probably consistent across most of the other permutations.

jaycedowell commented 2 years ago

If I had to guess what combinations would be useful:

jaycedowell commented 2 years ago

I don't know that I've come up with a good way to roll different CUDA versions in with the self hosted runner. The best I've been able to come up with is to alternate the container that is used between self hosted runs. It's like testing multiple versions of CUDA but with unknown combinations of other parameters.

league commented 2 years ago

different CUDA versions in with the self hosted runner

Might actually be something that nix can help with. Through nixpkgs, I can install cuda-X.Y and build with it – even on my GPU-less thinkpad. You just can't query the hardware (so configure needs its --with-gpu-archs). Hypothetically one could even do those builds on GitHub CI machines. (The built artifacts won't run without GPU however.) I haven't tried doing GPU builds on GitHub yet mainly because the download from nvidia can take forever, it's a many-GB package for each one. The workflow feature to cache stuff from one build to the next could help, but I'm not sure what the size limitations are.

But on self-hosted server, maybe it would work. Can you describe more about how self-hosted works? It's building on a standard ubuntu docker image but with GPU pass-through? Can we select which image to use, and is that per-build or per-repo? I can create a branch to try out some small experiments with self-hosted.

jaycedowell commented 2 years ago

Yes, it's based on a standard Ubuntu image with CUDA 11 installed. It's setup as an ephemeral runner so you get only one workflow/test per container launch. I have cron job that makes sure there is always a container running. Everything that makes it work is on the test machine in ~jdowell/selfhosted/. The biggest thing that I think it is missing right now, apart from the possibility of different CUDA versions, is limiting its access to only a single GPU.

league commented 2 years ago

based on a standard Ubuntu image with CUDA 11 installed

I see it now. So this means that installation of dependencies can be sort of partitioned between those specified in this Dockerfile (which would transcend different builds) and those in the main.yml (which are repeated with each build).

And further, if we wanted any additional data/packages to be cached or imported locally, we could do that via COPY or even VOLUME in the Dockerfile.

jaycedowell commented 2 years ago

That's right. The Dockerfile gives us the base system+CUDA that is ~constant between tests and then main.yml provides the test environment (at least the Python portion/dependencies) as well as the tests themselves. I will say that I'm a little nervous about using too much COPY or VOLUME to expose things to Docker. This is partially because of wanting to isolate the container as much as possible and partially because I had some problems using long-lived containers running multiple tests when I was first setting this up.

jaycedowell commented 2 years ago

Everything works (again)!

codecov-commenter commented 2 years ago

Codecov Report

Merging #174 (53cb1a0) into master (317d653) will increase coverage by 8.44%. The diff coverage is 90.90%.

:exclamation: Current head 53cb1a0 differs from pull request most recent head a4be8b5. Consider uploading reports for the commit a4be8b5 to get more accurate results

@@            Coverage Diff             @@
##           master     #174      +/-   ##
==========================================
+ Coverage   58.37%   66.81%   +8.44%     
==========================================
  Files          67       69       +2     
  Lines        5744     7410    +1666     
==========================================
+ Hits         3353     4951    +1598     
- Misses       2391     2459      +68     
Impacted Files Coverage Δ
python/bifrost/ndarray.py 83.51% <90.90%> (+10.36%) :arrow_up:
python/bifrost/version/__init__.py 100.00% <0.00%> (ø)
python/bifrost/libbifrost_generated.py 72.85% <0.00%> (ø)
python/bifrost/ring2.py 85.71% <0.00%> (+0.25%) :arrow_up:
python/bifrost/pipeline.py 84.87% <0.00%> (+1.17%) :arrow_up:
python/bifrost/DataType.py 66.89% <0.00%> (+2.02%) :arrow_up:
python/bifrost/blocks/transpose.py 96.87% <0.00%> (+3.12%) :arrow_up:
python/bifrost/memory.py 78.94% <0.00%> (+3.50%) :arrow_up:
python/bifrost/libbifrost.py 79.22% <0.00%> (+23.37%) :arrow_up:
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 317d653...a4be8b5. Read the comment docs.

jaycedowell commented 2 years ago

It'd be really nice if we could get the codecov report off the self hosted runner but I have yet to figure out how to do that.

league commented 2 years ago

Definitely, we want to see the coverage reported by a GPU runner. I'll toy with that a little today.

league commented 2 years ago

It looks like it is uploading, but from multiple platforms so the self-hosted one probably doesn't "win". I'll try some stuff, but apologies if it revisits what you already figured out. :)

league commented 2 years ago

Ah, now I see how env_vars works and where to find the reports per OS: https://codecov.io/gh/ledatelescope/bifrost/commit/35d9f702414de9feff2f234a22267f25d90f069c/build

So it's getting all three, I'm just not sure how/whether it's merging them together for the front-page numbers that it shows.

league commented 2 years ago

I downloaded the per-OS files on the build page. It does show that self-hosted has an increment higher than the other two. I don't know why they don't exactly match the final numbers in the interface:

All of those report files consider lines-valid to be 7410. Then the interface shows the project totals ratio to be 3364/5766 = 0.5834 (close to, but not exactly the same as the ubuntu/macos number).

ubuntu.xml:298:<coverage version="6.3.2" timestamp="1650589912156" lines-valid="7410" lines-covered="4472" line-rate="0.6035" branches-covered="0" branches-valid="0" branch-rate="0" complexity="0">
selfhost.xml:483:<coverage version="6.3.2" timestamp="1650591784082" lines-valid="7410" lines-covered="4951" line-rate="0.6682" branches-covered="0" branches-valid="0" branch-rate="0" complexity="0">
macos.xml:298:<coverage version="6.3.2" timestamp="1650590279525" lines-valid="7410" lines-covered="4472" line-rate="0.6035" branches-covered="0" branches-valid="0" branch-rate="0" complexity="0">
jaycedowell commented 2 years ago

I had started out with only trying to have one coverage report from self-hosted and Py3.8 but every time I did that I ended up with some kind of "missing commit"-style message from codecov. That's why I switched to trying multiple reports. If multiple reports is the only answer then maybe we need a codecov YAML file so that we can set the notify after_n_builds option. Maybe that would force some kind of aggregation.

league commented 2 years ago

I see the error now. Pretty non-specific, but there's a good chance it does have to do with the source code paths. I noticed in the XML-ish files I downloaded that paths were pretty different:

Ubuntu:

<class name="fdmt.py" filename="/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/bifrost/fdmt.py" complexity="0" line-rate="0.5" branch-rate="0">

Mac:

<class name="fdmt.py" filename="/Users/runner/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/bifrost/fdmt.py" complexity="0" line-rate="0.5" branch-rate="0">

Self-hosted:

<class name="fdmt.py" filename="/home/docker/actions-runner/_work/_tool/Python/3.8.12/x64/lib/python3.8/site-packages/bifrost/fdmt.py" complexity="0" line-rate="1" branch-rate="0">

They're all different, but maybe it knows how to deal with the first two. Will try their advice about fixing up the paths in the codecov yml.

jaycedowell commented 2 years ago

That worked!