lacker / seticore

A high-performance implementation of some core SETI algorithms that can be included in other programs.
MIT License
3 stars 6 forks source link

Version sync needed on BL compute nodes to match the github main branch #13

Closed texadactyl closed 2 years ago

texadactyl commented 2 years ago

The main branch seems to be a work in progress. True?

Just downloaded latest seticore main branch to a BL compute node under my login. Then,

apt package dependency versions (excluding cmake):

meson version: 0.62.2

Build issue: meson setup build ... fmt| WARNING: The version of CMake /home/lacker/cmake/bin/cmake is 3.10.2 but version >=3.14 is required

So, I installed new stable versions of cmake and ninja this way: pip install -U --user cmake ninja. Then, meson setup build succeeded. However, some subsequent compiles failed:

cd build; meson compile

[51/126] Compiling Cuda object seticore.p/dedoppler.cu.o
FAILED: seticore.p/dedoppler.cu.o 
nvcc -Iseticore.p -Xcompiler=-Wall,-Winvalid-pch,-Wnon-virtual-dtor -Werror=cross-execution-space-call,deprecated-declarations,reorder -O3 -I/usr/local/cuda/include -I/usr/local/include -I/usr/include -DBOOST_ALL_NO_LIB -I../subprojects/capnproto -Isubprojects/capnproto -I../subprojects/capnproto/__CMake_build -Isubprojects/capnproto/__CMake_build -I../subprojects/capnproto/c++/src -I../subprojects/capnproto/__CMake_build/c++/src/capnp/test_capnp -Isubprojects/capnproto/__CMake_build/c++/src/capnp/test_capnp -I../subprojects/capnproto -Isubprojects/capnproto -I../subprojects/capnproto/__CMake_build -Isubprojects/capnproto/__CMake_build -I../subprojects/capnproto/c++/src -I../subprojects/fmt -Isubprojects/fmt -I../subprojects/fmt/__CMake_build -Isubprojects/fmt/__CMake_build -I../subprojects/fmt/include -I.. -I. -Iseticore.p -o seticore.p/dedoppler.cu.o -c ../dedoppler.cu
In file included from /usr/include/c++/5/type_traits:35:0,
                 from ../subprojects/fmt/include/fmt/core.h:17,
                 from ../dedoppler.cu:4:
/usr/include/c++/5/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
 #error This file requires compiler and library support \
  ^
[55/126] Compiling Cuda object seticore.p/cuda_util.cu.o
FAILED: seticore.p/cuda_util.cu.o 
nvcc -Iseticore.p -Xcompiler=-Wall,-Winvalid-pch,-Wnon-virtual-dtor -Werror=cross-execution-space-call,deprecated-declarations,reorder -O3 -I/usr/local/cuda/include -I/usr/local/include -I/usr/include -DBOOST_ALL_NO_LIB -I../subprojects/capnproto -Isubprojects/capnproto -I../subprojects/capnproto/__CMake_build -Isubprojects/capnproto/__CMake_build -I../subprojects/capnproto/c++/src -I../subprojects/capnproto/__CMake_build/c++/src/capnp/test_capnp -Isubprojects/capnproto/__CMake_build/c++/src/capnp/test_capnp -I../subprojects/capnproto -Isubprojects/capnproto -I../subprojects/capnproto/__CMake_build -Isubprojects/capnproto/__CMake_build -I../subprojects/capnproto/c++/src -I../subprojects/fmt -Isubprojects/fmt -I../subprojects/fmt/__CMake_build -Isubprojects/fmt/__CMake_build -I../subprojects/fmt/include -I.. -I. -Iseticore.p -o seticore.p/cuda_util.cu.o -c ../cuda_util.cu
../cuda_util.h(8): error: this declaration has no storage class or type specifier

../cuda_util.h(8): error: too many initializer values

2 errors detected in the compilation of "../cuda_util.cu".
[56/126] Compiling Cuda object seticore.p/beamformer.cu.o
FAILED: seticore.p/beamformer.cu.o 
nvcc -Iseticore.p -Xcompiler=-Wall,-Winvalid-pch,-Wnon-virtual-dtor -Werror=cross-execution-space-call,deprecated-declarations,reorder -O3 -I/usr/local/cuda/include -I/usr/local/include -I/usr/include -DBOOST_ALL_NO_LIB -I../subprojects/capnproto -Isubprojects/capnproto -I../subprojects/capnproto/__CMake_build -Isubprojects/capnproto/__CMake_build -I../subprojects/capnproto/c++/src -I../subprojects/capnproto/__CMake_build/c++/src/capnp/test_capnp -Isubprojects/capnproto/__CMake_build/c++/src/capnp/test_capnp -I../subprojects/capnproto -Isubprojects/capnproto -I../subprojects/capnproto/__CMake_build -Isubprojects/capnproto/__CMake_build -I../subprojects/capnproto/c++/src -I../subprojects/fmt -Isubprojects/fmt -I../subprojects/fmt/__CMake_build -Isubprojects/fmt/__CMake_build -I../subprojects/fmt/include -I.. -I. -Iseticore.p -o seticore.p/beamformer.cu.o -c ../beamformer.cu
../cuda_util.h(8): error: this declaration has no storage class or type specifier

../cuda_util.h(8): error: too many initializer values

2 errors detected in the compilation of "../beamformer.cu".
[92/126] Compiling C++ object subprojects/capnproto/capnpc_cpp.p/c++_src_capnp_compiler_capnpc-c++.c++.o
ninja: build stopped: subcommand failed.

Still a work in progress? BL compute nodes have newly-discovered issues?

Note that the version (0.0.4) of seticore in /usr/local/bin does not match the main branch but seems to be the latest stable version.

Suggestions:

lacker commented 2 years ago

The main branch is stable, targeting Ubuntu 20. Check the Dockerfile, or run test_docker.sh to run the tests against the supported environment. You do have to set up the Nvidia container toolkit so that the code inside Docker can access the GPU. I added better docs explaining this in https://github.com/lacker/seticore/commit/96af5ac245a577e73953fcf07b05df1de9bbf3bf

So yeah, I don't have an up-to-date binary available at Berkeley. It's such an old environment it's sort of annoying to make things work there constantly. I was hoping I could just wait til Ubuntu 22 rolls out. I'm not sure what's going wrong with the output you pasted - maybe it is using an old version of nvcc?

I'll keep this issue open til I do manage to deploy to Berkeley....

texadactyl commented 2 years ago

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

This worked fine with seticore 0.0.4 (prior to meson).

texadactyl commented 2 years ago

Is there a schedule for Ubuntu 22.04 rollout at the BL compute nodes?

lacker commented 2 years ago

All right so the problem here seems to be that nvcc picks up its c++ version default from the default of g++, which is c++98 at berkeley and c++14 in the Docker container, and meson is not automatically deducing that it should be passing along a --std=c++14 flag to nvcc. I'll poke around and see what the right way is to do this in meson.

lacker commented 2 years ago

I think this is fixed with https://github.com/lacker/seticore/commit/b1ca077306e71ab0291c56d95d008c5755dc572e but i'm not entirely sure since I don't have a repro of the initial problem....

texadactyl commented 2 years ago

"permission denied"

(base) texadactyl@blpc0:~$ git clone https://github.com/lacker/seticore
Cloning into 'seticore'...
remote: Enumerating objects: 667, done.
remote: Counting objects: 100% (246/246), done.
remote: Compressing objects: 100% (181/181), done.
remote: Total 667 (delta 153), reused 150 (delta 61), pack-reused 421
Receiving objects: 100% (667/667), 211.00 KiB | 0 bytes/s, done.
Resolving deltas: 100% (416/416), done.
Checking connectivity... done.

(base) texadactyl@blpc0:~$ cd seticore

(base) texadactyl@blpc0:~/seticore$ git submodule init
Submodule 'raw' (git@github.com:lacker/raw.git) registered for path 'raw'
Submodule 'capnproto' (https://github.com/capnproto/capnproto.git) registered for path 'subprojects/capnproto'
Submodule 'fmt' (https://github.com/fmtlib/fmt) registered for path 'subprojects/fmt'

(base) texadactyl@blpc0:~/seticore$ git submodule update
Cloning into 'raw'...

Permission denied (publickey).git clone
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
fatal: clone of 'git@github.com:lacker/raw.git' into submodule path 'raw' failed

I was able to download the raw repo separately and read/write the files therein.

texadactyl commented 2 years ago

The .git/config file looks odd to me. See [submodule "raw"]:

[core]
    repositoryformatversion = 0
    filemode = true
    bare = false
    logallrefupdates = true
[remote "origin"]
    url = https://github.com/lacker/seticore
    fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
    remote = origin
    merge = refs/heads/master
[submodule "raw"]
    url = git@github.com:lacker/raw.git
[submodule "capnproto"]
    url = https://github.com/capnproto/capnproto.git
[submodule "fmt"]
    url = https://github.com/fmtlib/fmt

Shouldn't it be formatted just like "fmt" and "capnproto"? Maybe just a redo of the link is what is needed so that anyone can submodule-update "raw".

Also, shouldn't reside under "subprojects"?

lacker commented 2 years ago

Yeah I thought it would work the same way but I guess it is actually just better to submodule using the https form. I switched it up. hopefully it works for you now.

subprojects is a meson thing, capnproto and fmt are included as cmake subprojects, meaning meson will use the cmake build provided by those projects to build them. raw isn't built as a subproject, though, it's a header-only library so you don't need any compilation step beyond just #include "raw/raw.h". From meson's point of view, capnproto and fmt are subprojects whereas raw is not.

texadactyl commented 2 years ago

Currently, the following is necessary at BL compute nodes prior to running seticore version 0.0.6:

export LD_LIBRARY_PATH=/usr/local/cuda-11.0.3/targets/x86_64-linux/lib

This is not necessary for seticore version 0.0.4 (/usr/local/bin).

I built seticore following the README instructions. HDF5 file: http://blpd0.ssl.berkeley.edu/Voyager_data/Voyager1.single_coarse.fine_res.h5 Running out of GPU memory at BL compute nodes no matter which device ID I use. E.g.

(base) texadactyl@blpc2:~/seticore/build$ CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=3

(base) texadactyl@blpc2:~/seticore/build$ ./seticore *h5
welcome to seticore, version 0.0.6
loading input from Voyager1.single_coarse.fine_res.h5
dedoppler parameters: max_drift=10.00 min_drift=0.0001 snr=25.00
writing output to Voyager1.single_coarse.fine_res.dat
drift rate resolution: -0.0102043
cuda error 2: out of memory

nvidia-smi
Mon Jun 27 05:41:58 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.04    Driver Version: 455.23.04    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 00000000:18:00.0 Off |                  N/A |
| 22%   41C    P5    20W / 250W |      0MiB / 12212MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 00000000:3B:00.0 Off |                  N/A |
|  0%   35C    P8     7W / 180W |    395MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:86:00.0 Off |                  N/A |
|  0%   32C    P8     9W / 250W |      0MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:AF:00.0 Off |                  N/A |
|  0%   32C    P8     8W / 250W |      0MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    1   N/A  N/A    637993      C   .../envs/pipeline/bin/python      393MiB |
+-----------------------------------------------------------------------------+
lacker commented 2 years ago

Hmm. So, as far as I can tell, setting up LD_LIBRARY_PATH is part of the way that the cuda toolkit is supposed to be installed. I think the cuda install at berkeley is essentially broken or incomplete. I think it would make sense to make a custom script to deploy binaries there, but, I want to just wait til ubuntu 22 and for now manually modifying LD_LIBRARY_PATH seems okay.

The out of memory error, it would be nice if it showed better debugging information there, because the cuda malloc is often when trouble happens. Do you also get an out-of-memory error when you run the included test script, ie ./test_dedoppler.sh? I just want to narrow down whether it's specific to the voyager file or a general memory allocation error.

texadactyl commented 2 years ago

test_doppler.sh result

./test_dedoppler.sh
downloading h5 data for regression testing...
--2022-06-27 11:34:07--  https://bldata.berkeley.edu/pipeline/AGBT21B_999_31/blc17_blp17/blc17_guppi_59544_62191_HIP99317_0059.rawspec.0000.h5
Resolving bldata.berkeley.edu (bldata.berkeley.edu)... 208.68.240.101
Connecting to bldata.berkeley.edu (bldata.berkeley.edu)|208.68.240.101|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3230416399 (3.0G) [application/octet-stream]
Saving to: ‘data/blc17_guppi_59544_62191_HIP99317_0059.rawspec.0000.h5’

blc17_guppi_59544_62191_HIP99317_0059.rawspec.0000.h5               100%[==================================================================================================================================================================>]   3.01G   435MB/s    in 7.0s    

2022-06-27 11:35:10 (437 MB/s) - ‘data/blc17_guppi_59544_62191_HIP99317_0059.rawspec.0000.h5’ saved [3230416399/3230416399]

welcome to seticore, version 0.0.6
loading input from data/blc17_guppi_59544_62191_HIP99317_0059.rawspec.0000.h5
dedoppler parameters: max_drift=0.40 min_drift=0.0000 snr=10.00
writing output to data/testout.hits
drift rate resolution: -0.0102043
hit: coarse channel = 23, index = 566231, snr = 10.069617, drift rate = -0.000000 (0 bins)
hit: coarse channel = 30, index = 384478, snr = 215.748932, drift rate = -0.000000 (0 bins)
hit: coarse channel = 30, index = 388533, snr = 10.816778, drift rate = -0.102043 (10 bins)
hit: coarse channel = 60, index = 418095, snr = 37.858383, drift rate = 0.081634 (-8 bins)
hit: coarse channel = 60, index = 436060, snr = 16.085958, drift rate = 0.020409 (-2 bins)
hit: coarse channel = 60, index = 438649, snr = 10.715086, drift rate = -0.051021 (5 bins)
hit: coarse channel = 60, index = 440394, snr = 23.783800, drift rate = -0.214289 (21 bins)
dedoppler elapsed time 9s
diffing against expected output.
output matches. regression test looks good

The execution of 0.0.6 using the Voyager 1 file still runs out of GPU memory.

The execution of 0.0.4 using the Voyager 1 file:

welcome to seticore, version 0.0.4
loading input from Voyager1.single_coarse.fine_res.h5
dedoppler parameters: max_drift=10.00 min_drift=0.0001 snr=25.00
writing output to Voyager1.single_coarse.fine_res.dat
drift rate resolution: -0.0102043
hit: coarse channel = 0, index = 739933, snr = 30.626724, drift rate = -0.397966 (39 bins)
hit: coarse channel = 0, index = 747929, snr = 245.825119, drift rate = -0.377557 (37 bins)
hit: coarse channel = 0, index = 756037, snr = 31.235535, drift rate = -0.397966 (39 bins)
dedoppler elapsed time 1s
texadactyl commented 2 years ago

Suggestion: add Voyager 1 h5 to the regression tests.

lacker commented 2 years ago

FWIW I tested out Voyager1.single_coarse.fine_res.h5 and it is working fine on my machine.

$ ./build/seticore data/Voyager1.single_coarse.fine_res.h5 --max_drift=0.4 --snr=10 --min_drift=0 --output=data/testout.hits
welcome to seticore, version 0.0.6
loading input from data/Voyager1.single_coarse.fine_res.h5
dedoppler parameters: max_drift=0.40 min_drift=0.0000 snr=10.00
writing output to data/testout.hits
drift rate resolution: -0.0102043
hit: coarse channel = 10922, index = 32, snr = 236069.875000, drift rate = -0.000000 (0 bins)
hit: coarse channel = 15413, index = 0, snr = 15.369071, drift rate = -0.326536 (32 bins)
hit: coarse channel = 15415, index = 16, snr = 14.831736, drift rate = -0.316332 (31 bins)
hit: coarse channel = 15753, index = 9, snr = 15.295180, drift rate = -0.367353 (36 bins)
dedoppler elapsed time 8s

There's a few different integration tests now, requiring some 40G or so of downloads to run the full integration tests, so I hesitate to just keep adding them.

texadactyl commented 2 years ago

turbo_seti gets this from the same file:

# --------------------------
# Top_Hit_#     Drift_Rate  SNR     Uncorrected_Frequency   Corrected_Frequency     Index   freq_start  freq_end    SEFD    SEFD_freq   Coarse_Channel_Number   Full_number_of_hits     
# --------------------------
000001   -0.397966   30.612333     8419.319368     8419.319368  739933     8419.321559     8419.317181  0.0       0.000000  0   29856   
000002   -0.377557  245.709610     8419.297028     8419.297028  747929     8419.299218     8419.294840  0.0       0.000000  0   29856   
000003   -0.397966   31.220858     8419.274374     8419.274374  756037     8419.276565     8419.272187  0.0       0.000000  0   29856   

Has the dedoppler algorithm changed? It use to duplicate turbo_seti fairly well.

lacker commented 2 years ago

Hmmm, the algorithm is supposed to track. I was just testing on a different file and found no differences. Not sure what's going on here. Perhaps something to do with the coarse channel inference logic? Looks like seticore is concluding there's 10000's of coarse channels, which doesn't seem like it's the right way to interpret this data.

texadactyl commented 2 years ago

According to Danny, the Voyager 1 file was a special production: there is only 1 coarse channel.

texadactyl commented 2 years ago

Looking better on Voyager 1 h5 with version 0.1.0:

seticore -M 4.0 -s 20 V*h5
welcome to seticore, version 0.1.0
running in dedoppler mode.
loading input from Voyager1.single_coarse.fine_res.h5
dedoppler parameters: max_drift=4.00 min_drift=0.0001 snr=20.00
writing output to Voyager1.single_coarse.fine_res.dat
drift rate resolution: -0.0102043
hit: coarse channel = 0, index = 739933, snr = 30.626724, drift rate = -0.397966 (39 bins)
hit: coarse channel = 0, index = 747929, snr = 245.825119, drift rate = -0.377557 (37 bins)
hit: coarse channel = 0, index = 756037, snr = 31.235535, drift rate = -0.397966 (39 bins)
dedoppler elapsed time 3s

turboSETI:

find_doppler.0  INFO     Top hit found! SNR 30.612128, Drift Rate -0.392226, index 739933
find_doppler.0  INFO     Top hit found! SNR 245.707984, Drift Rate -0.373093, index 747929
find_doppler.0  INFO     Top hit found! SNR 31.220652, Drift Rate -0.392226, index 756037
lacker commented 2 years ago

I found a bug in the coarse channel detection logic where it was using uninitialized memory - that could explain some of this not-quite-deterministic failure. It's fixed in 0.1.4

lacker commented 2 years ago

All right - installing ninja with python as per your suggestion in the other issue is actually great, then the only thing I needed an updated version of was cmake, and the rest of the instructions worked at berkeley. I pushed an updated version - 0.1.6 - to /usr/local/bin/seticore and I'll try to keep that vaguely up to date.