Closed texadactyl closed 2 years ago
The main branch is stable, targeting Ubuntu 20. Check the Dockerfile, or run test_docker.sh
to run the tests against the supported environment. You do have to set up the Nvidia container toolkit so that the code inside Docker can access the GPU. I added better docs explaining this in https://github.com/lacker/seticore/commit/96af5ac245a577e73953fcf07b05df1de9bbf3bf
So yeah, I don't have an up-to-date binary available at Berkeley. It's such an old environment it's sort of annoying to make things work there constantly. I was hoping I could just wait til Ubuntu 22 rolls out. I'm not sure what's going wrong with the output you pasted - maybe it is using an old version of nvcc?
I'll keep this issue open til I do manage to deploy to Berkeley....
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
This worked fine with seticore 0.0.4 (prior to meson).
Is there a schedule for Ubuntu 22.04 rollout at the BL compute nodes?
All right so the problem here seems to be that nvcc
picks up its c++ version default from the default of g++
, which is c++98 at berkeley and c++14 in the Docker container, and meson is not automatically deducing that it should be passing along a --std=c++14
flag to nvcc
. I'll poke around and see what the right way is to do this in meson.
I think this is fixed with https://github.com/lacker/seticore/commit/b1ca077306e71ab0291c56d95d008c5755dc572e but i'm not entirely sure since I don't have a repro of the initial problem....
"permission denied"
(base) texadactyl@blpc0:~$ git clone https://github.com/lacker/seticore
Cloning into 'seticore'...
remote: Enumerating objects: 667, done.
remote: Counting objects: 100% (246/246), done.
remote: Compressing objects: 100% (181/181), done.
remote: Total 667 (delta 153), reused 150 (delta 61), pack-reused 421
Receiving objects: 100% (667/667), 211.00 KiB | 0 bytes/s, done.
Resolving deltas: 100% (416/416), done.
Checking connectivity... done.
(base) texadactyl@blpc0:~$ cd seticore
(base) texadactyl@blpc0:~/seticore$ git submodule init
Submodule 'raw' (git@github.com:lacker/raw.git) registered for path 'raw'
Submodule 'capnproto' (https://github.com/capnproto/capnproto.git) registered for path 'subprojects/capnproto'
Submodule 'fmt' (https://github.com/fmtlib/fmt) registered for path 'subprojects/fmt'
(base) texadactyl@blpc0:~/seticore$ git submodule update
Cloning into 'raw'...
Permission denied (publickey).git clone
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
fatal: clone of 'git@github.com:lacker/raw.git' into submodule path 'raw' failed
I was able to download the raw repo separately and read/write the files therein.
The .git/config file looks odd to me. See [submodule "raw"]:
[core]
repositoryformatversion = 0
filemode = true
bare = false
logallrefupdates = true
[remote "origin"]
url = https://github.com/lacker/seticore
fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
remote = origin
merge = refs/heads/master
[submodule "raw"]
url = git@github.com:lacker/raw.git
[submodule "capnproto"]
url = https://github.com/capnproto/capnproto.git
[submodule "fmt"]
url = https://github.com/fmtlib/fmt
Shouldn't it be formatted just like "fmt" and "capnproto"? Maybe just a redo of the link is what is needed so that anyone can submodule-update "raw".
Also, shouldn't reside under "subprojects"?
Yeah I thought it would work the same way but I guess it is actually just better to submodule using the https form. I switched it up. hopefully it works for you now.
subprojects
is a meson thing, capnproto and fmt are included as cmake subprojects, meaning meson will use the cmake build provided by those projects to build them. raw
isn't built as a subproject, though, it's a header-only library so you don't need any compilation step beyond just #include "raw/raw.h"
. From meson's point of view, capnproto and fmt are subprojects whereas raw is not.
Currently, the following is necessary at BL compute nodes prior to running seticore version 0.0.6:
export LD_LIBRARY_PATH=/usr/local/cuda-11.0.3/targets/x86_64-linux/lib
This is not necessary for seticore version 0.0.4 (/usr/local/bin).
I built seticore following the README instructions. HDF5 file: http://blpd0.ssl.berkeley.edu/Voyager_data/Voyager1.single_coarse.fine_res.h5 Running out of GPU memory at BL compute nodes no matter which device ID I use. E.g.
(base) texadactyl@blpc2:~/seticore/build$ CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=3
(base) texadactyl@blpc2:~/seticore/build$ ./seticore *h5
welcome to seticore, version 0.0.6
loading input from Voyager1.single_coarse.fine_res.h5
dedoppler parameters: max_drift=10.00 min_drift=0.0001 snr=25.00
writing output to Voyager1.single_coarse.fine_res.dat
drift rate resolution: -0.0102043
cuda error 2: out of memory
nvidia-smi
Mon Jun 27 05:41:58 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.04 Driver Version: 455.23.04 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 00000000:18:00.0 Off | N/A |
| 22% 41C P5 20W / 250W | 0MiB / 12212MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 Off | 00000000:3B:00.0 Off | N/A |
| 0% 35C P8 7W / 180W | 395MiB / 8119MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:86:00.0 Off | N/A |
| 0% 32C P8 9W / 250W | 0MiB / 11178MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:AF:00.0 Off | N/A |
| 0% 32C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 1 N/A N/A 637993 C .../envs/pipeline/bin/python 393MiB |
+-----------------------------------------------------------------------------+
Hmm. So, as far as I can tell, setting up LD_LIBRARY_PATH is part of the way that the cuda toolkit is supposed to be installed. I think the cuda install at berkeley is essentially broken or incomplete. I think it would make sense to make a custom script to deploy binaries there, but, I want to just wait til ubuntu 22 and for now manually modifying LD_LIBRARY_PATH seems okay.
The out of memory error, it would be nice if it showed better debugging information there, because the cuda malloc is often when trouble happens. Do you also get an out-of-memory error when you run the included test script, ie ./test_dedoppler.sh
? I just want to narrow down whether it's specific to the voyager file or a general memory allocation error.
test_doppler.sh result
./test_dedoppler.sh
downloading h5 data for regression testing...
--2022-06-27 11:34:07-- https://bldata.berkeley.edu/pipeline/AGBT21B_999_31/blc17_blp17/blc17_guppi_59544_62191_HIP99317_0059.rawspec.0000.h5
Resolving bldata.berkeley.edu (bldata.berkeley.edu)... 208.68.240.101
Connecting to bldata.berkeley.edu (bldata.berkeley.edu)|208.68.240.101|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3230416399 (3.0G) [application/octet-stream]
Saving to: ‘data/blc17_guppi_59544_62191_HIP99317_0059.rawspec.0000.h5’
blc17_guppi_59544_62191_HIP99317_0059.rawspec.0000.h5 100%[==================================================================================================================================================================>] 3.01G 435MB/s in 7.0s
2022-06-27 11:35:10 (437 MB/s) - ‘data/blc17_guppi_59544_62191_HIP99317_0059.rawspec.0000.h5’ saved [3230416399/3230416399]
welcome to seticore, version 0.0.6
loading input from data/blc17_guppi_59544_62191_HIP99317_0059.rawspec.0000.h5
dedoppler parameters: max_drift=0.40 min_drift=0.0000 snr=10.00
writing output to data/testout.hits
drift rate resolution: -0.0102043
hit: coarse channel = 23, index = 566231, snr = 10.069617, drift rate = -0.000000 (0 bins)
hit: coarse channel = 30, index = 384478, snr = 215.748932, drift rate = -0.000000 (0 bins)
hit: coarse channel = 30, index = 388533, snr = 10.816778, drift rate = -0.102043 (10 bins)
hit: coarse channel = 60, index = 418095, snr = 37.858383, drift rate = 0.081634 (-8 bins)
hit: coarse channel = 60, index = 436060, snr = 16.085958, drift rate = 0.020409 (-2 bins)
hit: coarse channel = 60, index = 438649, snr = 10.715086, drift rate = -0.051021 (5 bins)
hit: coarse channel = 60, index = 440394, snr = 23.783800, drift rate = -0.214289 (21 bins)
dedoppler elapsed time 9s
diffing against expected output.
output matches. regression test looks good
The execution of 0.0.6 using the Voyager 1 file still runs out of GPU memory.
The execution of 0.0.4 using the Voyager 1 file:
welcome to seticore, version 0.0.4
loading input from Voyager1.single_coarse.fine_res.h5
dedoppler parameters: max_drift=10.00 min_drift=0.0001 snr=25.00
writing output to Voyager1.single_coarse.fine_res.dat
drift rate resolution: -0.0102043
hit: coarse channel = 0, index = 739933, snr = 30.626724, drift rate = -0.397966 (39 bins)
hit: coarse channel = 0, index = 747929, snr = 245.825119, drift rate = -0.377557 (37 bins)
hit: coarse channel = 0, index = 756037, snr = 31.235535, drift rate = -0.397966 (39 bins)
dedoppler elapsed time 1s
Suggestion: add Voyager 1 h5 to the regression tests.
FWIW I tested out Voyager1.single_coarse.fine_res.h5
and it is working fine on my machine.
$ ./build/seticore data/Voyager1.single_coarse.fine_res.h5 --max_drift=0.4 --snr=10 --min_drift=0 --output=data/testout.hits
welcome to seticore, version 0.0.6
loading input from data/Voyager1.single_coarse.fine_res.h5
dedoppler parameters: max_drift=0.40 min_drift=0.0000 snr=10.00
writing output to data/testout.hits
drift rate resolution: -0.0102043
hit: coarse channel = 10922, index = 32, snr = 236069.875000, drift rate = -0.000000 (0 bins)
hit: coarse channel = 15413, index = 0, snr = 15.369071, drift rate = -0.326536 (32 bins)
hit: coarse channel = 15415, index = 16, snr = 14.831736, drift rate = -0.316332 (31 bins)
hit: coarse channel = 15753, index = 9, snr = 15.295180, drift rate = -0.367353 (36 bins)
dedoppler elapsed time 8s
There's a few different integration tests now, requiring some 40G or so of downloads to run the full integration tests, so I hesitate to just keep adding them.
turbo_seti gets this from the same file:
# --------------------------
# Top_Hit_# Drift_Rate SNR Uncorrected_Frequency Corrected_Frequency Index freq_start freq_end SEFD SEFD_freq Coarse_Channel_Number Full_number_of_hits
# --------------------------
000001 -0.397966 30.612333 8419.319368 8419.319368 739933 8419.321559 8419.317181 0.0 0.000000 0 29856
000002 -0.377557 245.709610 8419.297028 8419.297028 747929 8419.299218 8419.294840 0.0 0.000000 0 29856
000003 -0.397966 31.220858 8419.274374 8419.274374 756037 8419.276565 8419.272187 0.0 0.000000 0 29856
Has the dedoppler algorithm changed? It use to duplicate turbo_seti fairly well.
Hmmm, the algorithm is supposed to track. I was just testing on a different file and found no differences. Not sure what's going on here. Perhaps something to do with the coarse channel inference logic? Looks like seticore is concluding there's 10000's of coarse channels, which doesn't seem like it's the right way to interpret this data.
According to Danny, the Voyager 1 file was a special production: there is only 1 coarse channel.
Looking better on Voyager 1 h5 with version 0.1.0:
seticore -M 4.0 -s 20 V*h5
welcome to seticore, version 0.1.0
running in dedoppler mode.
loading input from Voyager1.single_coarse.fine_res.h5
dedoppler parameters: max_drift=4.00 min_drift=0.0001 snr=20.00
writing output to Voyager1.single_coarse.fine_res.dat
drift rate resolution: -0.0102043
hit: coarse channel = 0, index = 739933, snr = 30.626724, drift rate = -0.397966 (39 bins)
hit: coarse channel = 0, index = 747929, snr = 245.825119, drift rate = -0.377557 (37 bins)
hit: coarse channel = 0, index = 756037, snr = 31.235535, drift rate = -0.397966 (39 bins)
dedoppler elapsed time 3s
turboSETI:
find_doppler.0 INFO Top hit found! SNR 30.612128, Drift Rate -0.392226, index 739933
find_doppler.0 INFO Top hit found! SNR 245.707984, Drift Rate -0.373093, index 747929
find_doppler.0 INFO Top hit found! SNR 31.220652, Drift Rate -0.392226, index 756037
I found a bug in the coarse channel detection logic where it was using uninitialized memory - that could explain some of this not-quite-deterministic failure. It's fixed in 0.1.4
All right - installing ninja with python as per your suggestion in the other issue is actually great, then the only thing I needed an updated version of was cmake, and the rest of the instructions worked at berkeley. I pushed an updated version - 0.1.6 - to /usr/local/bin/seticore and I'll try to keep that vaguely up to date.
The main branch seems to be a work in progress. True?
Just downloaded latest seticore main branch to a BL compute node under my login. Then,
apt package dependency versions (excluding cmake):
meson version: 0.62.2
Build issue:
meson setup build
...fmt| WARNING: The version of CMake /home/lacker/cmake/bin/cmake is 3.10.2 but version >=3.14 is required
So, I installed new stable versions of cmake and ninja this way:
pip install -U --user cmake ninja
. Then,meson setup build
succeeded. However, some subsequent compiles failed:Still a work in progress? BL compute nodes have newly-discovered issues?
Note that the version (0.0.4) of seticore in
/usr/local/bin
does not match the main branch but seems to be the latest stable version.Suggestions:
/usr/local/bin
on the BL compute nodes.