ratt-ru / pfb-imaging

Preconditioned forward/backward clean algorithm
MIT License
7 stars 5 forks source link

distributed gridding on the CHPC #48

Closed landmanbester closed 2 years ago

landmanbester commented 3 years ago

Attempting to set up pfb-clean in a virtualenv on the CHPC and thought I would document it here for future reference. I have done (on scp.chpc.ac.za)

$ module load chpc/astro/casacore/3.2.1_dev
$ conda create -n pfbenv
$ conda activate pfbenv

That complained with

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.

So I ran

$ conda init zsh
$ source .zshrc
$ conda activate pfbenv

which seems to put me in a python 3.76 virtualenv. I then did (after cloning)

$ pip install --user -I -e pfb-clean/

which seems to work

Successfully installed Click-7.1.2 Jinja2-2.11.3 MarkupSafe-1.1.1 PyWavelets-1.1.1 absl-py-0.12.0 appdirs-1.4.4 argparse-1.4.0 asciitree-0.3.3 astropy-4.2.1 attrs-20.3.0 bokeh-2.3.1 certifi-2020.12.5 chardet-4.0.0 cloudpickle-1.6.0 codex-africanus-0.2.10 cycler-0.10.0 dask-2021.4.1 dask-ms-0.2.6 dataclasses-0.6 decorator-5.0.7 ducc0-0.9.0 fasteners-0.16 flake8-3.9.1 flaky-3.7.0 fsspec-2021.4.0 future-0.18.2 idna-2.10 imageio-2.9.0 importlib-metadata-4.0.1 iniconfig-1.1.1 jax-0.1.68 jaxlib-0.1.47 jsonschema-3.2.0 katbeam-0.1 kiwisolver-1.3.1 llvmlite-0.36.0 locket-0.2.1 matplotlib-3.4.1 mccabe-0.6.1 networkx-2.5.1 numba-0.53.1 numcodecs-0.7.3 numexpr-2.7.3 numpy-1.20.2 omegaconf-2.0.6 opt-einsum-3.3.0 packaging-20.9 packratt-0.1.3 pandas-1.2.4 partd-1.2.0 pfb-clean pillow-8.2.0 pluggy-0.13.1 psutil-5.8.0 py-1.10.0 pycodestyle-2.7.0 pyerfa-1.7.3 pyflakes-2.3.1 pyparsing-2.4.7 pyrsistent-0.17.3 pyscilog-0.1.2 pytest-6.2.4 pytest-flake8-1.0.7 python-casacore-3.4.0 python-dateutil-2.8.1 pytz-2021.1 pyyaml-5.4.1 requests-2.25.1 scikit-image-0.18.1 scipy-1.6.3 setuptools-56.1.0 six-1.15.0 tifffile-2021.4.8 toml-0.10.2 toolz-0.11.1 tornado-6.1 typing-extensions-3.10.0.0 urllib3-1.26.4 xarray-0.17.0 zarr-2.8.1 zipp-3.4.1

Then I remove the local python-casacore and check which version it is using

$ pip uninstall python-casacore
$ pip freeze | grep python-casacore
-e git+https://github.com/casacore/python-casacore.git@e5c1954115f08cf4c4468bd5514b44616e76e95d#egg=python_casacore

I assume this is the correct version @JSKenyon @sjperkins?

sjperkins commented 3 years ago

Do you have this kind of setup in your .bashrc (qsub aliases included because they may be useful)?

alias qsubint='qsub -I -P ASTR0123 -q serial -l select=1:ncpus=1:mpiprocs=1:nodetype=haswell_reg'
alias qsubscheduler='qsub -I -P ASTR0123 -q serial -l select=1:ncpus=4:mpiprocs=4:nodetype=haswell_reg'

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
#__conda_setup="$('/apps/chpc/astro/anaconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
#if [ $? -eq 0 ]; then
#    eval "$__conda_setup"
#else
#    if [ -f "/apps/chpc/astro/anaconda3/etc/profile.d/conda.sh" ]; then
#        . "/apps/chpc/astro/anaconda3/etc/profile.d/conda.sh"
#    else
#        export PATH="/apps/chpc/astro/anaconda3/bin:$PATH"
#    fi
#fi
#unset __conda_setup
# <<< conda initialize <<<
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/apps/chpc/astro/anaconda3_dev/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/apps/chpc/astro/anaconda3_dev/etc/profile.d/conda.sh" ]; then
        . "/apps/chpc/astro/anaconda3_dev/etc/profile.d/conda.sh"
    else
        export PATH="/apps/chpc/astro/anaconda3_dev/bin:$PATH"
    fi
fi
unset __conda_setup
landmanbester commented 3 years ago

The conda setup yes, aliases no but I'll add them. I assume I should replace ASTR0123 with the project I am registered for viz ASTR0967?

sjperkins commented 3 years ago

Hmmm, that works for me:

(base) [sperkins@login2 ~]$ module load chpc/astro/casacore/3.2.1_dev 
(base) [sperkins@login2 ~]$ conda create -n pfbenv
WARNING: A conda environment already exists at '/home/sperkins/.conda/envs/pfbenv'
Remove existing environment (y/[n])? y

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/sperkins/.conda/envs/pfbenv

Proceed ([y]/n)? y

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate pfbenv
#
# To deactivate an active environment, use
#
#     $ conda deactivate

(base) [sperkins@login2 ~]$ conda activate pfbenv
(pfbenv) [sperkins@login2 ~]$ 

I wonder if its because you're using zsh instead of bash?

sjperkins commented 3 years ago

Hmmmm, but following on from that I get python 2.7.5 in my conda env:

(base) [sperkins@login2 ~]$ conda activate pfbenv
(pfbenv) [sperkins@login2 ~]$ python --version
Python 2.7.5
(pfbenv) [sperkins@login2 ~]$ 
landmanbester commented 3 years ago

I think it switches to using python3 after

$ module load chpc/astro/casacore/3.2.1_dev

I got the cluster up and managed to connect to the client but something is going wrong when I try to run the application (note I also had to upgrade networkx to solve version conflict). The command and output is

(pfbenv) [lbester@cnode1314 lustre]$ pfbworkers dirty -dc DATA -wc WEIGHT -o /mnt/lustre/users/lbester/pfbout/test -nb 4 -fov 1.5 -srf 2.0 -nthreads 24 -ha 'tcp://172.19.5.233:41127' msdir/point_gauss_nb.MS_p0
INFO      17:53:28 - DIRTY              | Input Options:
INFO      17:53:28 - DIRTY              |                    data_column = DATA
INFO      17:53:28 - DIRTY              |                  weight_column = WEIGHT
INFO      17:53:28 - DIRTY              |                output_filename = /mnt/lustre/users/lbester/pfbout/test
INFO      17:53:28 - DIRTY              |                          nband = 4
INFO      17:53:28 - DIRTY              |                  field_of_view = 1.5
INFO      17:53:28 - DIRTY              |        super_resolution_factor = 2.0
INFO      17:53:28 - DIRTY              |                       nthreads = 24
INFO      17:53:28 - DIRTY              |                   host_address = tcp://172.19.5.233:41127
INFO      17:53:28 - DIRTY              |          imaging_weight_column = None
INFO      17:53:28 - DIRTY              |                        epsilon = 1e-05
INFO      17:53:28 - DIRTY              |                         wstack = True
INFO      17:53:28 - DIRTY              |                   double_accum = True
INFO      17:53:28 - DIRTY              |                    flag_column = FLAG
INFO      17:53:28 - DIRTY              |                 mueller_column = None
INFO      17:53:28 - DIRTY              |                      cell_size = None
INFO      17:53:28 - DIRTY              |                             nx = None
INFO      17:53:28 - DIRTY              |                             ny = None
INFO      17:53:28 - DIRTY              |                      mem_limit = 0
INFO      17:53:28 - DIRTY              |                    output_type = f4
Traceback (most recent call last):
  File "/home/lbester/.local/bin/pfbworkers", line 33, in <module>
    sys.exit(load_entry_point('pfb-clean', 'console_scripts', 'pfbworkers')())
  File "/home/lbester/.local/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/lbester/.local/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/lbester/.local/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/lbester/.local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/lbester/.local/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/lbester/pfb-clean/pfb/workers/grid/dirty.py", line 96, in dirty
    ms, nband=args.nband)
  File "/home/lbester/pfb-clean/pfb/utils/misc.py", line 229, in chan_to_band_mapping
    group_cols="__row__")
  File "/home/lbester/.local/lib/python3.7/site-packages/daskms/dask_ms.py", line 326, in xds_from_table
    **kwargs).datasets()
  File "/home/lbester/.local/lib/python3.7/site-packages/daskms/reads.py", line 413, in datasets
    np_sorted_row = sorted_rows.compute()
  File "/home/lbester/.local/lib/python3.7/site-packages/dask/base.py", line 285, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/lbester/.local/lib/python3.7/site-packages/dask/base.py", line 567, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/lbester/.local/lib/python3.7/site-packages/distributed/client.py", line 2666, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/home/lbester/.local/lib/python3.7/site-packages/distributed/client.py", line 1981, in gather
    asynchronous=asynchronous,
  File "/home/lbester/.local/lib/python3.7/site-packages/distributed/client.py", line 844, in sync
    self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
  File "/home/lbester/.local/lib/python3.7/site-packages/distributed/utils.py", line 353, in sync
    raise exc.with_traceback(tb)
  File "/home/lbester/.local/lib/python3.7/site-packages/distributed/utils.py", line 336, in f
    result[0] = yield future
  File "/home/lbester/.local/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/home/lbester/.local/lib/python3.7/site-packages/distributed/client.py", line 1840, in _gather
    raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: ("('rows-214a61fd6efb6411f808fd1d69ce6c89', 0)", <Worker 'tcp://172.19.2.225:37631', name: 0, memory: 0, processing: 1>)
(pfbenv) [lbester@cnode1314 lustre]$ pfbworkers dirty -dc DATA -wc WEIGHT -o /mnt/lustre/users/lbester/pfbout/test -nb 4 -fov 1.5 -srf 2.0 -nthreads 24 -ha 'tcp://172.19.5.233:41127' msdir/point_gauss_nb.MS_p0

Note the address it's complaining about 'tcp://172.19.2.225:37631' is not the same as the address passed into the application. Any ideas what could be causing this?

sjperkins commented 3 years ago

You'll probably need to check the individual stderr logfiles to see why the worker died

landmanbester commented 3 years ago

I forgot to give the full path to the MS in the lustre file system. I can now produce a dirty image and it looks identical to one made locally so I guess it works. Dashboard looks like

image

although is is just a small test MS. Trying to move a bigger one over but this is turning out to be more difficult than I thought.