league commented 2 years ago

Would make for easier CI on them. They serve both as documentation and test. Making this into a PR so it may be easier to navigate the commits involved, etc.

This branch was rewritten from the bifrost_tutorial repo. All dates and authorships should be the same, but we're pretending that everything was done in a tutorial/ subdir all along.

league commented 2 years ago

Just don't squash-merge, then the carefully-constructed lies in the history would be lost. :)

codecov-commenter commented 2 years ago

Codecov Report

Merging #179 (114369b) into master (eb5a8c5) will decrease coverage by 0.05%. The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master     #179      +/-   ##
==========================================
- Coverage   66.81%   66.76%   -0.06%     
==========================================
  Files          69       69              
  Lines        7410     7416       +6     
==========================================
  Hits         4951     4951              
- Misses       2459     2465       +6

Impacted Files	Coverage Δ
python/bifrost/telemetry/__init__.py	`34.63% <0.00%> (-1.21%)`	:arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update eb5a8c5...114369b. Read the comment docs.

jaycedowell commented 2 years ago

Just to play devil's advocate: what about using https://github.community/t/triggering-by-other-repository/16163/12 to trigger testing the tutorial when master is updated?

league commented 2 years ago

Sure, I think that would be fine. Or go back just to scheduled but add a manual "run this workflow" button so project members can trigger it if there are doubts.

Part of the rationale for merging tutorials here is also to help with your runners-per-repo problem... maybe that has a better solution too.

jaycedowell commented 2 years ago

I'm actually leaning towards moving the tutorials into the main repo since I think it will make things easier for new users to get up and running. Plus, it could be helpful from a testing perspective to have more examples of how Bifrost is used in additional real world-like contexts.

I threw a poll on the slack channel to see if anyone else has an opinion.

jaycedowell commented 2 years ago

Eh, it doesn't seem to do anything to the coverage.

jaycedowell commented 2 years ago

For tutorial 6 do we want to update that to point to the ibverb-support branch here on ledatelescope?

league commented 2 years ago

I'm giggling at c01ab – elegant enough to be a “haque” rather than a hack.

I wasn't sure whether ibverb-support branch would execute this test yet, but I may have just been grepping for definitions and got it wrong.

So I tried it just now, and the writer chunk is reliably causing a runtime crash reported as:

python3: hw_locality.hpp:60: BoundThread::BoundThread(int):
Assertion `_hwloc.bind_memory_to_core(core) == 0' failed.

So that looks like fun.

2022-07-07_18-28

I saved the in-progress work at league/bifrost/tutorial-merge including the install log. It shows:

configure: cuda: yes - v11.1 - 70 75
configure: numa: yes
configure: hwloc: yes
configure: libvma: no
configure: libverbs: yes
configure: librdmacm: yes
configure: python bindings: yes
configure: memory alignment: 4096
configure: logging directory: /dev/shm/bifrost
configure: options: native map_cache

I couldn't get libvma easily, but assumed that should be okay.

jaycedowell commented 2 years ago

hwloc shouldn't be required for these tests. What if you throw a --disable-hwloc to configure?

league commented 2 years ago

Yes, went through with --disable-hwloc and also reduce bifrost.quantize.quantize to just bifrost.quantize.

jaycedowell commented 2 years ago

Excellent. Now I just need to fix the testing on cetus.

jaycedowell commented 2 years ago

Things should start running now.

league commented 2 years ago

I'm revising tutorial/README — should the Docker/Amazon instructions be really minimized/barebones/deleted? The issue is that both of them reference images that we would need to manually keep up to date. For now they could continue to exist, but conceivably the tutorials in ledatelescope/bifrost/master will eventually deviate substantively. If we didn't have the colab option, then a workflow to (re)generate tutorial images somehow might be useful, but it seems like too much work for too little benefit.

jaycedowell commented 2 years ago

Personally, I would like to minimize how many external/supplementation things that we need to maintain. I'm be happy with the AWS stuff being deleted. Docker is harder but I'm not seeing a lot of added value there now that building Bifrost is a lot easier. So maybe "yes, delete" for AWS and "yes, barebones" for Docker?

league commented 2 years ago

I think I agree. The value of Amazon was only to try it out from a machine with no GPU. And I just noticed that you updated the Dockerfile in eadcf24d45517b0a7365b255bb5300b810739584 so it should be relatively easy to run and push to dockerhub once in a while. Keeping AWS images updated would be more work than that.

league commented 2 years ago

I'll try rerun of tutorial Dockerfile on cetus.

league commented 2 years ago

I think I'm going to merge tutorial stuff shortly.

league commented 2 years ago

I think that one out-of-memory report on self-hosted 3.6 could be because I was simultaneously messing with a docker build. I think the Dockerfile works again. I could run jupyter through it and it had Bifrost installed and could run anything non-GPU. When running something using GPU it hanged, but I guess because both GPUs are being used by the runners. (Or maybe I need to do something more than just --runtime=nvidia.)

jaycedowell commented 2 years ago

I've been doing some testing on the north arm pipelines and there are a couple of those running. That uses some GPU memory.

ledatelescope / bifrost

Merge tutorials into this repo? #179

Codecov Report