MilesCranmer commented 2 years ago

Google Colab is a web-based Jupyter notebook environment which gives free access to P100 GPUs. I think it will make for a great tool for trying out Bifrost without needing to do any configuration whatsoever; even less configuration than with Docker. (@jaycedowell and I discussed this in a call a month ago and I decided to get it working.)

This PR creates a Jupyter notebook that can be opened in colab, and will automatically configure and install Bifrost, with the GPU interface working(!), for users to try out.

The demo itself is pretty short, but could grow into a full tutorial. The new README link references the live copy of the notebook in the master branch so the colab will mirror the GitHub version.

https://colab.research.google.com/github/ledatelescope/bifrost/blob/master/BifrostDemo.ipynb

This link won't work until this is merged so until then you can use https://colab.research.google.com/drive/129ZH4VAnDPRMH3rR-OPiMr7pzr01ZSqf?usp=sharing.

For the most part the regular installation of Bifrost works (the %%shell Jupyter command can be used to install things in the virtual machine), but the one catch is you need to update LD_LIBRARY_PATH from within python. I also switched to use the autoconf version in #157 but the old installation seems to work also.

Cheers, Miles

coveralls commented 2 years ago

Coverage remained the same at 61.364% when pulling c186633324d3bb7b244e9bd48be0ff5e4187870d on google_colab into 1681fde6e643fcc03a3cea10e411b8411aeb31cc on master.

codecov-commenter commented 2 years ago

Codecov Report

Merging #158 (c186633) into master (1681fde) will not change coverage. The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #158   +/-   ##
=======================================
  Coverage   58.46%   58.46%           
=======================================
  Files          65       65           
  Lines        5549     5549           
=======================================
  Hits         3244     3244           
  Misses       2305     2305

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 1681fde...c186633. Read the comment docs.

league commented 2 years ago

Nice, look forward to trying it later today. This has been on my task list since that call, but I ran into some snag right away that I haven't found time to solve. (I had not used Colab with GPU before.) I see you built it from the autoconf branch, so that's good. Thanks!

league commented 2 years ago

Okay, ran into an issue – could be colab, but could be an issue with ./configure too related to cuda arch detection?

I copied the notebook you linked on your drive into my account. The blocks installing dependencies seemed to proceed okay. For the script that ran the bifrost install, the configure summary looked like this:

configure: cuda: yes - 30 37
configure: numa: yes
configure: hwloc: yes
configure: libvma: no
configure: python bindings: yes
configure: memory alignment: 4096
configure: logging directory: /dev/shm/bifrost
configure: options: native

Bifrost is now ready to be compiled.  Please run 'make'

But then as soon as it started to run make, a failure was reported:

make -C src all
make[1]: Entering directory '/root/bifrost_repo/src'
nvcc fatal   : Unsupported gpu architecture 'compute_30'
Makefile:134: recipe for target 'fft_kernels.o' failed

I ran this in the same session, to see the archs that nvcc supports:

! nvcc --list-gpu-arch
compute_35
compute_37
compute_50
compute_52
compute_53
compute_60
compute_61
compute_62
compute_70
compute_72
compute_75
compute_80
compute_86

So I think the configure reported that 30, 37 would work, but 30 did not. I changed the install script to use

./configure --with-gpu-archs=37

and it seems to be doing better. Does it mean our auto-detection needs work?

league commented 2 years ago

Follow-up: potentially useful section of the config.log when it auto-detected.

configure:19313: checking for nvcc
configure:19337: found /usr/local/cuda/bin/nvcc
configure:19350: result: /usr/local/cuda/bin/nvcc
configure:19360: checking for nvprune
configure:19384: found /usr/local/cuda/bin/nvprune
configure:19397: result: /usr/local/cuda/bin/nvprune
configure:19407: checking for cuobjdump
configure:19431: found /usr/local/cuda/bin/cuobjdump
configure:19444: result: /usr/local/cuda/bin/cuobjdump
configure:19455: checking for a working CUDA installation
configure:19477: /usr/local/cuda/bin/nvcc -c  conftest.cpp >&5
configure:19477: $? = 0
configure:19505: /usr/local/cuda/bin/nvcc -o conftest  -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib  -lnuma -lhwloc -lcuda -lcudart conftest.cpp >&5
configure:19505: $? = 0
configure:19507: result: yes
configure:19560: checking which CUDA architectures to target
configure:19622: /usr/local/cuda/bin/nvcc -o conftest -O3 -Xcompiler "-Wall" -DBF_CUDA_ENABLED=1 -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib -lcuda -lcudart conftest.cpp >&5
configure:19622: $? = 0
configure:19622: ./conftest
configure:19622: $? = 0
configure:19626: result: 30 37
configure:19644: checking for valid CUDA architectures
configure:19651: result: yes
configure:19657: checking for Pascal-style CUDA managed memory
configure:19668: result: no
configure:19730: checking for /dev/shm
configure:19744: result: yes