Closed league closed 2 years ago
Just don't squash-merge, then the carefully-constructed lies in the history would be lost. :)
Merging #179 (114369b) into master (eb5a8c5) will decrease coverage by
0.05%
. The diff coverage is0.00%
.
@@ Coverage Diff @@
## master #179 +/- ##
==========================================
- Coverage 66.81% 66.76% -0.06%
==========================================
Files 69 69
Lines 7410 7416 +6
==========================================
Hits 4951 4951
- Misses 2459 2465 +6
Impacted Files | Coverage Δ | |
---|---|---|
python/bifrost/telemetry/__init__.py | 34.63% <0.00%> (-1.21%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update eb5a8c5...114369b. Read the comment docs.
Just to play devil's advocate: what about using https://github.community/t/triggering-by-other-repository/16163/12 to trigger testing the tutorial when master is updated?
Sure, I think that would be fine. Or go back just to scheduled but add a manual "run this workflow" button so project members can trigger it if there are doubts.
Part of the rationale for merging tutorials here is also to help with your runners-per-repo problem... maybe that has a better solution too.
I'm actually leaning towards moving the tutorials into the main repo since I think it will make things easier for new users to get up and running. Plus, it could be helpful from a testing perspective to have more examples of how Bifrost is used in additional real world-like contexts.
I threw a poll on the slack channel to see if anyone else has an opinion.
Eh, it doesn't seem to do anything to the coverage.
For tutorial 6 do we want to update that to point to the ibverb-support
branch here on ledatelescope?
I'm giggling at c01ab
– elegant enough to be a “haque” rather than a hack.
I wasn't sure whether ibverb-support
branch would execute this test yet, but I may have just been grepping for definitions and got it wrong.
So I tried it just now, and the writer chunk is reliably causing a runtime crash reported as:
python3: hw_locality.hpp:60: BoundThread::BoundThread(int):
Assertion `_hwloc.bind_memory_to_core(core) == 0' failed.
So that looks like fun.
I saved the in-progress work at league/bifrost/tutorial-merge including the install log. It shows:
configure: cuda: yes - v11.1 - 70 75
configure: numa: yes
configure: hwloc: yes
configure: libvma: no
configure: libverbs: yes
configure: librdmacm: yes
configure: python bindings: yes
configure: memory alignment: 4096
configure: logging directory: /dev/shm/bifrost
configure: options: native map_cache
I couldn't get libvma easily, but assumed that should be okay.
hwloc shouldn't be required for these tests. What if you throw a --disable-hwloc
to configure?
Yes, went through with --disable-hwloc
and also reduce bifrost.quantize.quantize
to just bifrost.quantize
.
Excellent. Now I just need to fix the testing on cetus.
Things should start running now.
I'm revising tutorial/README — should the Docker/Amazon instructions be really minimized/barebones/deleted? The issue is that both of them reference images that we would need to manually keep up to date. For now they could continue to exist, but conceivably the tutorials in ledatelescope/bifrost/master will eventually deviate substantively. If we didn't have the colab option, then a workflow to (re)generate tutorial images somehow might be useful, but it seems like too much work for too little benefit.
Personally, I would like to minimize how many external/supplementation things that we need to maintain. I'm be happy with the AWS stuff being deleted. Docker is harder but I'm not seeing a lot of added value there now that building Bifrost is a lot easier. So maybe "yes, delete" for AWS and "yes, barebones" for Docker?
I think I agree. The value of Amazon was only to try it out from a machine with no GPU. And I just noticed that you updated the Dockerfile in eadcf24d45517b0a7365b255bb5300b810739584 so it should be relatively easy to run and push to dockerhub once in a while. Keeping AWS images updated would be more work than that.
I'll try rerun of tutorial Dockerfile on cetus.
I think I'm going to merge tutorial stuff shortly.
I think that one out-of-memory report on self-hosted 3.6 could be because I was simultaneously messing with a docker build. I think the Dockerfile works again. I could run jupyter through it and it had Bifrost installed and could run anything non-GPU. When running something using GPU it hanged, but I guess because both GPUs are being used by the runners. (Or maybe I need to do something more than just --runtime=nvidia
.)
I've been doing some testing on the north arm pipelines and there are a couple of those running. That uses some GPU memory.
Would make for easier CI on them. They serve both as documentation and test. Making this into a PR so it may be easier to navigate the commits involved, etc.
This branch was rewritten from the bifrost_tutorial repo. All dates and authorships should be the same, but we're pretending that everything was done in a
tutorial/
subdir all along.