ledatelescope / bifrost

A stream processing framework for high-throughput applications.
BSD 3-Clause "New" or "Revised" License
64 stars 29 forks source link

Create simple Google Colab demo #158

Closed MilesCranmer closed 2 years ago

MilesCranmer commented 2 years ago

Google Colab is a web-based Jupyter notebook environment which gives free access to P100 GPUs. I think it will make for a great tool for trying out Bifrost without needing to do any configuration whatsoever; even less configuration than with Docker. (@jaycedowell and I discussed this in a call a month ago and I decided to get it working.)

This PR creates a Jupyter notebook that can be opened in colab, and will automatically configure and install Bifrost, with the GPU interface working(!), for users to try out.

The demo itself is pretty short, but could grow into a full tutorial. The new README link references the live copy of the notebook in the master branch so the colab will mirror the GitHub version.

https://colab.research.google.com/github/ledatelescope/bifrost/blob/master/BifrostDemo.ipynb

This link won't work until this is merged so until then you can use https://colab.research.google.com/drive/129ZH4VAnDPRMH3rR-OPiMr7pzr01ZSqf?usp=sharing.

For the most part the regular installation of Bifrost works (the %%shell Jupyter command can be used to install things in the virtual machine), but the one catch is you need to update LD_LIBRARY_PATH from within python. I also switched to use the autoconf version in #157 but the old installation seems to work also.

Cheers, Miles

coveralls commented 2 years ago

Coverage Status

Coverage remained the same at 61.364% when pulling c186633324d3bb7b244e9bd48be0ff5e4187870d on google_colab into 1681fde6e643fcc03a3cea10e411b8411aeb31cc on master.

codecov-commenter commented 2 years ago

Codecov Report

Merging #158 (c186633) into master (1681fde) will not change coverage. The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #158   +/-   ##
=======================================
  Coverage   58.46%   58.46%           
=======================================
  Files          65       65           
  Lines        5549     5549           
=======================================
  Hits         3244     3244           
  Misses       2305     2305           

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 1681fde...c186633. Read the comment docs.

league commented 2 years ago

Nice, look forward to trying it later today. This has been on my task list since that call, but I ran into some snag right away that I haven't found time to solve. (I had not used Colab with GPU before.) I see you built it from the autoconf branch, so that's good. Thanks!

league commented 2 years ago

Okay, ran into an issue – could be colab, but could be an issue with ./configure too related to cuda arch detection?

I copied the notebook you linked on your drive into my account. The blocks installing dependencies seemed to proceed okay. For the script that ran the bifrost install, the configure summary looked like this:

configure: cuda: yes - 30 37
configure: numa: yes
configure: hwloc: yes
configure: libvma: no
configure: python bindings: yes
configure: memory alignment: 4096
configure: logging directory: /dev/shm/bifrost
configure: options: native

Bifrost is now ready to be compiled.  Please run 'make'

But then as soon as it started to run make, a failure was reported:

make -C src all
make[1]: Entering directory '/root/bifrost_repo/src'
nvcc fatal   : Unsupported gpu architecture 'compute_30'
Makefile:134: recipe for target 'fft_kernels.o' failed

I ran this in the same session, to see the archs that nvcc supports:

! nvcc --list-gpu-arch
compute_35
compute_37
compute_50
compute_52
compute_53
compute_60
compute_61
compute_62
compute_70
compute_72
compute_75
compute_80
compute_86

So I think the configure reported that 30, 37 would work, but 30 did not. I changed the install script to use

./configure --with-gpu-archs=37

and it seems to be doing better. Does it mean our auto-detection needs work?

league commented 2 years ago

Follow-up: potentially useful section of the config.log when it auto-detected.

configure:19313: checking for nvcc
configure:19337: found /usr/local/cuda/bin/nvcc
configure:19350: result: /usr/local/cuda/bin/nvcc
configure:19360: checking for nvprune
configure:19384: found /usr/local/cuda/bin/nvprune
configure:19397: result: /usr/local/cuda/bin/nvprune
configure:19407: checking for cuobjdump
configure:19431: found /usr/local/cuda/bin/cuobjdump
configure:19444: result: /usr/local/cuda/bin/cuobjdump
configure:19455: checking for a working CUDA installation
configure:19477: /usr/local/cuda/bin/nvcc -c  conftest.cpp >&5
configure:19477: $? = 0
configure:19505: /usr/local/cuda/bin/nvcc -o conftest  -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib  -lnuma -lhwloc -lcuda -lcudart conftest.cpp >&5
configure:19505: $? = 0
configure:19507: result: yes
configure:19560: checking which CUDA architectures to target
configure:19622: /usr/local/cuda/bin/nvcc -o conftest -O3 -Xcompiler "-Wall" -DBF_CUDA_ENABLED=1 -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib -lcuda -lcudart conftest.cpp >&5
configure:19622: $? = 0
configure:19622: ./conftest
configure:19622: $? = 0
configure:19626: result: 30 37
configure:19644: checking for valid CUDA architectures
configure:19651: result: yes
configure:19657: checking for Pascal-style CUDA managed memory
configure:19668: result: no
configure:19730: checking for /dev/shm
configure:19744: result: yes
jaycedowell commented 2 years ago

This was attempt in autoconf to deal with #117 where it appears that you needed to compile with GPU arch. 50 in addition to 5X to have things work on Maxwell. I generalized this to all archs. but maybe it needs some work to prune out things that don't exist in the current CUDA install.

telegraphic commented 2 years ago

@MilesCranmer very cool! Nice there's a place with free GPUs.

jaycedowell commented 2 years ago

@league it looks like the "valid arch" test isn't working as expected in cuda.m4. It would be interesting to see what the values of ar_requested, ar_supported, ar_valid, and ar_found are on collab.

jaycedowell commented 2 years ago

e45ac5db05c754495801a5f5dcfb9e7dd26be511 at least gets configure to know that 30 is a bad arch and fail. I'm not sure what the best thing to do here is since the behavior I would want is situation specific:

MilesCranmer commented 2 years ago

Thanks! @league good catch. So while colab has an identical VM for all instances, the GPU itself can be different: P100, T4, or K40 (depending on their availability and whether on free tier or not). The one which showed up in my instance was a P100, and the one which showed up for you is–I think–a K40. So yes it definitely seems like the arch should be autodetected in compilation.

Will add the --with-gpu-archs=37 for now. It works for the P100 too.

jaycedowell commented 2 years ago

@MilesCranmer c3450e4ebd2746cbc9e701d0f2bfbb06adfb4f0a should fix the automatic arch. detection on colab.

jaycedowell commented 2 years ago

A couple of things I noticed from today:

In file included from /usr/local/cuda/include/thrust/detail/config/config.h:27:0,
                 from /usr/local/cuda/include/thrust/detail/config.h:23,
                 from /usr/local/cuda/include/thrust/random.h:23,
                 from romein_kernels.cuh:6,
                 from romein.cu:37:
/usr/local/cuda/include/thrust/detail/config/cpp_dialect.h:104:13: warning: Thrust
   requires C++14. Please pass -std=c++14 to your compiler. Define 
   THRUST_IGNORE_DEPRECATED_CPP_DIALECT to suppress this message.
   THRUST_COMPILER_DEPRECATION(C++14, pass -std=c++14 to your compile
r);

and

Building wheels for collected packages: bifrost
  Building wheel for bifrost (setup.py) ... done
  Created wheel for bifrost: filename=bifrost-..-py3-none-any.whl size=177871 sha256=91afb4db4da01046812a8e76775297012187b3f5570f9b4b8aca3b6e65b79847
  Stored in directory: /tmp/pip-ephem-wheel-cache-xxkw04si/wheels/5b/88/bb/4f07f6235f452a6ce297916eba9ef03b0e138f2a0e4cefb35f
  WARNING: Built wheel for bifrost is invalid: Metadata 1.2 mandates PEP 440 version, but '..' is not
Failed to build bifrost
jaycedowell commented 2 years ago

bb01d95 takes care of the C++14 stuff. The Python API still has a version of '..'.

jaycedowell commented 2 years ago

d1430c3ca11517b32f6d4f88f9f05985f0ccdfb9 takes care of the Python version problem.

MilesCranmer commented 2 years ago

Works for me! Ready to merge?

After the merge, the README.md link should be updated to https://colab.research.google.com/github/ledatelescope/bifrost/blob/master/BifrostDemo.ipynb

jaycedowell commented 2 years ago

Chris is also going to give this a try tomorrow. If that checks out as well then, yes, let's merge this.

league commented 2 years ago

Hey guys, I was successful with the colab demo. I successfully built it from the latest commit on autoconf branch (d1430c3ca11517b32f6d4f88f9f05985f0ccdfb9), without any special arguments to ./configure this time. As far as I'm concerned, this and that look ready to merge. Nice work!