Closed landmanbester closed 2 years ago
Do you have this kind of setup in your .bashrc
(qsub aliases included because they may be useful)?
alias qsubint='qsub -I -P ASTR0123 -q serial -l select=1:ncpus=1:mpiprocs=1:nodetype=haswell_reg'
alias qsubscheduler='qsub -I -P ASTR0123 -q serial -l select=1:ncpus=4:mpiprocs=4:nodetype=haswell_reg'
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
#__conda_setup="$('/apps/chpc/astro/anaconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
#if [ $? -eq 0 ]; then
# eval "$__conda_setup"
#else
# if [ -f "/apps/chpc/astro/anaconda3/etc/profile.d/conda.sh" ]; then
# . "/apps/chpc/astro/anaconda3/etc/profile.d/conda.sh"
# else
# export PATH="/apps/chpc/astro/anaconda3/bin:$PATH"
# fi
#fi
#unset __conda_setup
# <<< conda initialize <<<
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/apps/chpc/astro/anaconda3_dev/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/apps/chpc/astro/anaconda3_dev/etc/profile.d/conda.sh" ]; then
. "/apps/chpc/astro/anaconda3_dev/etc/profile.d/conda.sh"
else
export PATH="/apps/chpc/astro/anaconda3_dev/bin:$PATH"
fi
fi
unset __conda_setup
The conda setup yes, aliases no but I'll add them. I assume I should replace ASTR0123 with the project I am registered for viz ASTR0967?
Hmmm, that works for me:
(base) [sperkins@login2 ~]$ module load chpc/astro/casacore/3.2.1_dev
(base) [sperkins@login2 ~]$ conda create -n pfbenv
WARNING: A conda environment already exists at '/home/sperkins/.conda/envs/pfbenv'
Remove existing environment (y/[n])? y
Collecting package metadata (current_repodata.json): done
Solving environment: done
## Package Plan ##
environment location: /home/sperkins/.conda/envs/pfbenv
Proceed ([y]/n)? y
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
# $ conda activate pfbenv
#
# To deactivate an active environment, use
#
# $ conda deactivate
(base) [sperkins@login2 ~]$ conda activate pfbenv
(pfbenv) [sperkins@login2 ~]$
I wonder if its because you're using zsh instead of bash?
Hmmmm, but following on from that I get python 2.7.5 in my conda env:
(base) [sperkins@login2 ~]$ conda activate pfbenv
(pfbenv) [sperkins@login2 ~]$ python --version
Python 2.7.5
(pfbenv) [sperkins@login2 ~]$
I think it switches to using python3 after
$ module load chpc/astro/casacore/3.2.1_dev
I got the cluster up and managed to connect to the client but something is going wrong when I try to run the application (note I also had to upgrade networkx to solve version conflict). The command and output is
(pfbenv) [lbester@cnode1314 lustre]$ pfbworkers dirty -dc DATA -wc WEIGHT -o /mnt/lustre/users/lbester/pfbout/test -nb 4 -fov 1.5 -srf 2.0 -nthreads 24 -ha 'tcp://172.19.5.233:41127' msdir/point_gauss_nb.MS_p0
INFO 17:53:28 - DIRTY | Input Options:
INFO 17:53:28 - DIRTY | data_column = DATA
INFO 17:53:28 - DIRTY | weight_column = WEIGHT
INFO 17:53:28 - DIRTY | output_filename = /mnt/lustre/users/lbester/pfbout/test
INFO 17:53:28 - DIRTY | nband = 4
INFO 17:53:28 - DIRTY | field_of_view = 1.5
INFO 17:53:28 - DIRTY | super_resolution_factor = 2.0
INFO 17:53:28 - DIRTY | nthreads = 24
INFO 17:53:28 - DIRTY | host_address = tcp://172.19.5.233:41127
INFO 17:53:28 - DIRTY | imaging_weight_column = None
INFO 17:53:28 - DIRTY | epsilon = 1e-05
INFO 17:53:28 - DIRTY | wstack = True
INFO 17:53:28 - DIRTY | double_accum = True
INFO 17:53:28 - DIRTY | flag_column = FLAG
INFO 17:53:28 - DIRTY | mueller_column = None
INFO 17:53:28 - DIRTY | cell_size = None
INFO 17:53:28 - DIRTY | nx = None
INFO 17:53:28 - DIRTY | ny = None
INFO 17:53:28 - DIRTY | mem_limit = 0
INFO 17:53:28 - DIRTY | output_type = f4
Traceback (most recent call last):
File "/home/lbester/.local/bin/pfbworkers", line 33, in <module>
sys.exit(load_entry_point('pfb-clean', 'console_scripts', 'pfbworkers')())
File "/home/lbester/.local/lib/python3.7/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/lbester/.local/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/lbester/.local/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/lbester/.local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/lbester/.local/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/lbester/pfb-clean/pfb/workers/grid/dirty.py", line 96, in dirty
ms, nband=args.nband)
File "/home/lbester/pfb-clean/pfb/utils/misc.py", line 229, in chan_to_band_mapping
group_cols="__row__")
File "/home/lbester/.local/lib/python3.7/site-packages/daskms/dask_ms.py", line 326, in xds_from_table
**kwargs).datasets()
File "/home/lbester/.local/lib/python3.7/site-packages/daskms/reads.py", line 413, in datasets
np_sorted_row = sorted_rows.compute()
File "/home/lbester/.local/lib/python3.7/site-packages/dask/base.py", line 285, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/home/lbester/.local/lib/python3.7/site-packages/dask/base.py", line 567, in compute
results = schedule(dsk, keys, **kwargs)
File "/home/lbester/.local/lib/python3.7/site-packages/distributed/client.py", line 2666, in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
File "/home/lbester/.local/lib/python3.7/site-packages/distributed/client.py", line 1981, in gather
asynchronous=asynchronous,
File "/home/lbester/.local/lib/python3.7/site-packages/distributed/client.py", line 844, in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
File "/home/lbester/.local/lib/python3.7/site-packages/distributed/utils.py", line 353, in sync
raise exc.with_traceback(tb)
File "/home/lbester/.local/lib/python3.7/site-packages/distributed/utils.py", line 336, in f
result[0] = yield future
File "/home/lbester/.local/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/home/lbester/.local/lib/python3.7/site-packages/distributed/client.py", line 1840, in _gather
raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: ("('rows-214a61fd6efb6411f808fd1d69ce6c89', 0)", <Worker 'tcp://172.19.2.225:37631', name: 0, memory: 0, processing: 1>)
(pfbenv) [lbester@cnode1314 lustre]$ pfbworkers dirty -dc DATA -wc WEIGHT -o /mnt/lustre/users/lbester/pfbout/test -nb 4 -fov 1.5 -srf 2.0 -nthreads 24 -ha 'tcp://172.19.5.233:41127' msdir/point_gauss_nb.MS_p0
Note the address it's complaining about 'tcp://172.19.2.225:37631' is not the same as the address passed into the application. Any ideas what could be causing this?
You'll probably need to check the individual stderr logfiles to see why the worker died
I forgot to give the full path to the MS in the lustre file system. I can now produce a dirty image and it looks identical to one made locally so I guess it works. Dashboard looks like
although is is just a small test MS. Trying to move a bigger one over but this is turning out to be more difficult than I thought.
Attempting to set up pfb-clean in a virtualenv on the CHPC and thought I would document it here for future reference. I have done (on scp.chpc.ac.za)
That complained with
So I ran
which seems to put me in a python 3.76 virtualenv. I then did (after cloning)
which seems to work
Then I remove the local python-casacore and check which version it is using
I assume this is the correct version @JSKenyon @sjperkins?