legacysurvey / legacypipe

Image reduction pipeline for the DESI Legacy Imaging Surveys, using the Tractor framework
BSD 3-Clause "New" or "Revised" License
37 stars 22 forks source link

problems with GALEX forced photometry in large galaxies #699

Closed moustakas closed 1 year ago

moustakas commented 2 years ago

I am building GALEX through WISE mosaics and running Tractor on a sample of very large (NGC-scale) galaxies and galaxy groups and I'm having issues with the last stage, forced GALEX photometry. Here's one example traceback.

python /global/homes/i/ioannis/code/git/legacypipe/py/legacypipe/runbrick.py --radec 190.70299219786068 2.7037470387408766 \
  --width 4811 --height 4811 --pixscale 0.262 --threads 32 --outdir /global/cscratch1/sd/ioannis/virgofilaments-data/190/NGC4636_GROUP \
  --survey-dir /global/cfs/cdirs/cosmo/work/legacysurvey/dr9 --run south --skip-calibs \
  --checkpoint /global/cscratch1/sd/ioannis/virgofilaments-data/190/NGC4636_GROUP/NGC4636_GROUP-custom-checkpoint.p \
  --pickle /global/cscratch1/sd/ioannis/virgofilaments-data/190/NGC4636_GROUP/NGC4636_GROUP-custom-%%(stage)s.p \
  --galex --fit-on-coadds --no-ivar-reweighting
[snip]
Starting process 62069 Wall: -0.00 s, CPU: -0.00 s, VmPeak: 668 MB, VmSize: 668 MB, VmRSS: 71 MB, VmData: 136 MB, maxrss: 0.072256 MB
Running stage galex_forced
Running stage galex_forced at 2022-02-08T03:02:20.690856
Cut to 13 GALEX tiles
Reading GALEX tile MISDR1_13700_0522 band n
Reading GALEX tile MISDR1_13700_0522 band f
Reading GALEX tile MISGCSN1_13701_0229 band n
Reading GALEX tile AIS_228_sg79 band f
Reading GALEX tile MISGCSN1_13642_0229o band n
Reading GALEX tile AIS_228_sg88 band f
Reading GALEX tile GI6_012074_HRS241 band n
Reading GALEX tile AIS_228_sg89 band f
Reading GALEX tile MISGCSN1_13702_0229 band n
Reading GALEX tile AIS_228_sg90 band f
Reading GALEX tile AIS_228_sg96 band f
Reading GALEX tile MISGCSN1_13762_0229 band n
Reading GALEX tile MISGCSN3_13641_0229 band n
Reading GALEX tile AIS_228_sg79 band n
Reading GALEX tile AIS_228_sg88 band n
Reading GALEX tile AIS_228_sg89 band n
Reading GALEX tile AIS_228_sg90 band n
Reading GALEX tile AIS_228_sg96 band n
Reading GALEX tile MISGCSN1_13701_0229_css42303 band n
F0208 04:34:12.418515 62067 block_sparse_matrix.cc:79] Check failed: num_nonzeros_ >= 0 (-857378227 vs. 0)
*** Check failure stack trace: ***
    @     0x2aaadea6213d  google::LogMessage::Fail()
    @     0x2aaadea63fa3  google::LogMessage::SendToLog()
    @     0x2aaadea61ccb  google::LogMessage::Flush()
    @     0x2aaadea6498e  google::LogMessageFatal::~LogMessageFatal()
    @     0x2aaade616d47  ceres::internal::BlockSparseMatrix::BlockSparseMatrix()
    @     0x2aaade60c307  ceres::internal::BlockJacobianWriter::CreateJacobian()
    @     0x2aaade6db0bc  ceres::internal::TrustRegionPreprocessor::Preprocess()
    @     0x2aaade6d2f0f  ceres::Solver::Solve()
    @     0x2aaade6d4489  ceres::Solve()
    @     0x2aaade37d0d3  _ZN17_INTERNALdb2d2e0222real_ceres_forced_photIfEEP7_objectS2_S2_iiii.A
    @     0x2aaade377d70  _wrap_ceres_forced_phot.A
    @     0x5555556678be  _PyUnicode_EncodeCharmap.cold
    @     0x2aaadbace386  __Pyx_PyObject_Call
    @     0x2aaadbac7d60  __pyx_pf_7tractor_15ceres_optimizer_14CeresOptimizer_14_ceres_forced_photom
    @     0x2aaadbab9bdd  __pyx_pw_7tractor_15ceres_optimizer_14CeresOptimizer_15_ceres_forced_photom
    @     0x5555556bd8bb  binary_op
    @     0x2aaadbaaaade  __pyx_pw_7tractor_15ceres_optimizer_14CeresOptimizer_5_optimize_forcedphot_core
    @     0x5555556bd8bb  binary_op
    @     0x2aaadb88ffe6  __Pyx_PyObject_Call
    @     0x2aaadb85de38  __pyx_pf_7tractor_8optimize_9Optimizer_4forced_photometry
    @     0x2aaadb85b533  __pyx_pw_7tractor_8optimize_9Optimizer_5forced_photometry
    @     0x5555556bd8bb  binary_op
    @     0x2aaac0b2ff0f  __pyx_pf_7tractor_6engine_7Tractor_34optimize_forced_photometry
    @     0x2aaac0b2f82f  __pyx_pw_7tractor_6engine_7Tractor_35optimize_forced_photometry
    @     0x5555556fe7b3  unicode_upper
    @     0x5555556fee39  bytearray_append
    @     0x555555725dc9  list_ass_slice
    @     0x55555565be12  odict_repr.cold
    @     0x5555556b8dee  PyEval_EvalCodeEx
    @     0x5555557266cf  object_new
    @     0x5555556b8c0e  _PyEval_EvalCodeWithName
    @     0x5555556b9820  _PyObject_FastCallDict
Starting process 29501 Wall: -0.00 s, CPU: -0.00 s, VmPeak: 2098 MB, VmSize: 2098 MB, VmRSS: 1330 MB, VmData: 1404 MB, maxrss: 1.362128 MB

@dstndstn I'd be grateful for any thoughts or suggestions you may have.

moustakas commented 2 years ago

My initial thought was that the problem was related to using the Ceres solver, and so I added an option in #700 to not use Ceres. However, re-running the command above (with 32 cores---and maybe that's the issue?) will run for hours and just hang / not make any progress.

moustakas commented 1 year ago

@dstndstn

Here's a MWE (for Perlmutter) in case you have time to look into this. (Note that the version of legacypipe in the shifter container is pretty old, so you'll need to point to a more recent checkout of the repo; just change the $LEGACYPIPE_CODE_DIR variable to point to one of your check-outs before running the script):

SHIFTER=docker:legacysurvey/legacyhalos:v1.2
shifterimg pull $SHIFTER
shifter --image $SHIFTER bash

export LEGACYPIPE_CODE_DIR=/global/homes/i/ioannis/code/git/legacypipe
export PYTHONPATH=$LEGACYPIPE_CODE_DIR/py:$PYTHONPATH

source /pscratch/sd/i/ioannis/NGC4631_GROUP/virgofilaments-env

Then:

python $LEGACYPIPE_CODE_DIR/py/legacypipe/runbrick.py --radec 190.5282741091857 32.54475159391664 \
  --width 6353 --height 6353 --pixscale 0.262 --threads 1 --outdir /pscratch/sd/i/ioannis/NGC4631_GROUP \
  --bands g,r,z --survey-dir /global/cfs/cdirs/cosmo/work/legacysurvey/dr9 --run north --skip-calibs \
  --checkpoint /pscratch/sd/i/ioannis/NGC4631_GROUP/NGC4631_GROUP-custom-checkpoint.p \
  --pickle="/pscratch/sd/i/ioannis/NGC4631_GROUP/NGC4631_GROUP-custom-%%(stage)s.p" --galex \
  --fit-on-coadds --no-ivar-reweighting

And the output (either in an interactive/compute node or in a login node) I get is:

runbrick.py starting at 2023-04-13T13:42:17.071301
legacypipe git version: DR10.1.0-3-g75b6c288
Command-line args: ['/global/homes/i/ioannis/code/git/legacypipe/py/legacypipe/runbrick.py', '--radec', '190.5282741091857', '32.54475159391664', '--width', '6353', '--height', '6353', '--pixscale', '0.262', '--threads', '1', '--outdir', '/pscratch/sd/i/ioannis/NGC4631_GROUP', '--bands', 'g,r,z', '--survey-dir', '/global/cfs/cdirs/cosmo/work/legacysurvey/dr9', '--run', 'north', '--skip-calibs', '--checkpoint', '/pscratch/sd/i/ioannis/NGC4631_GROUP/NGC4631_GROUP-custom-checkpoint.p', '--pickle=/pscratch/sd/i/ioannis/NGC4631_GROUP/NGC4631_GROUP-custom-%%(stage)s.p', '--galex', '--fit-on-coadds', '--no-ivar-reweighting']
python /global/homes/i/ioannis/code/git/legacypipe/py/legacypipe/runbrick.py --radec 190.5282741091857 32.54475159391664 --width 6353 --height 6353 --pixscale 0.262 --threads 1 --outdir /pscratch/sd/i/ioannis/NGC4631_GROUP --bands g,r,z --survey-dir /global/cfs/cdirs/cosmo/work/legacysurvey/dr9 --run north --skip-calibs --checkpoint /pscratch/sd/i/ioannis/NGC4631_GROUP/NGC4631_GROUP-custom-checkpoint.p --pickle=/pscratch/sd/i/ioannis/NGC4631_GROUP/NGC4631_GROUP-custom-%%(stage)s.p --galex --fit-on-coadds --no-ivar-reweighting

Parsed RA,Dec 190.5282741091857 32.54475159391664
Starting process 253034 Wall: -0.00 s, CPU: -0.00 s, VmPeak: 556 MB, VmSize: 556 MB, VmRSS: 121 MB, VmData: 141 MB, maxrss: 0.124464 MB
Runstage writecat
Runstage galex_forced
Runstage wise_forced
Reading pickle /pscratch/sd/i/ioannis/NGC4631_GROUP/NGC4631_GROUP-custom-wise_forced.p
Running stage galex_forced
Running stage galex_forced at 2023-04-13T13:42:19.621945
Cut to 8 GALEX tiles
Reading GALEX tile NGA_NGC4631 band n
Reading GALEX tile GI1_047080_NGC4656 band n
Reading GALEX tile GI4_016001_CVNIDWA band n
Reading GALEX tile AIS_113_sg36 band n
Reading GALEX tile AIS_113_sg37 band n
Reading GALEX tile AIS_113_sg46 band n
Reading GALEX tile AIS_113_sg47 band n
Reading GALEX tile AIS_113_sg48 band n
F0413 13:56:59.081884 253034 block_sparse_matrix.cc:79] Check failed: num_nonzeros_ >= 0 (-1756951292 vs. 0)
*** Check failure stack trace: ***
    @     0x7f1f912e813d  google::LogMessage::Fail()
    @     0x7f1f912e9fa3  google::LogMessage::SendToLog()
    @     0x7f1f912e7ccb  google::LogMessage::Flush()
    @     0x7f1f912ea98e  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f1f91589d47  ceres::internal::BlockSparseMatrix::BlockSparseMatrix()
    @     0x7f1f9157f307  ceres::internal::BlockJacobianWriter::CreateJacobian()
    @     0x7f1f9164e0bc  ceres::internal::TrustRegionPreprocessor::Preprocess()
    @     0x7f1f91645f0f  ceres::Solver::Solve()
    @     0x7f1f91647489  ceres::Solve()
    @     0x7f1f919e0407  _ZN17_INTERNALdb2d2e0222real_ceres_forced_photIfEEP7_objectS2_S2_iiii.A
    @     0x7f1f919dbe85  _wrap_ceres_forced_phot.A
    @     0x55b4292a49a8  ast_for_expr
    @     0x7f1f940687ff  __Pyx_PyObject_Call
    @     0x7f1f9406202a  __pyx_pf_7tractor_15ceres_optimizer_14CeresOptimizer_14_ceres_forced_photom
    @     0x7f1f94053a91  __pyx_pw_7tractor_15ceres_optimizer_14CeresOptimizer_15_ceres_forced_photom
    @     0x7f20185e01ee  __Pyx_CyFunction_CallAsMethod
    @     0x55b429288361  os_DirEntry_is_symlink
    @     0x55b4292e7763  _PyObject_GenericGetAttrWithDict
    @     0x55b4292a4897  ast_for_expr
    @     0x7f1f94045892  __pyx_pw_7tractor_15ceres_optimizer_14CeresOptimizer_5_optimize_forcedphot_core
    @     0x7f20185e01ee  __Pyx_CyFunction_CallAsMethod
    @     0x55b429288361  os_DirEntry_is_symlink
    @     0x55b4292e7763  _PyObject_GenericGetAttrWithDict
    @     0x55b4292a4897  ast_for_expr
    @     0x7f1f942ba31f  __Pyx_PyObject_Call
    @     0x7f1f9428c1de  __pyx_pf_7tractor_8optimize_9Optimizer_4forced_photometry
    @     0x7f1f94289987  __pyx_pw_7tractor_8optimize_9Optimizer_5forced_photometry
    @     0x7f20185e01ee  __Pyx_CyFunction_CallAsMethod
    @     0x55b429288361  os_DirEntry_is_symlink
    @     0x55b4292e7763  _PyObject_GenericGetAttrWithDict
    @     0x55b4292a4897  ast_for_expr
    @     0x7f20185c6d86  __pyx_pw_7tractor_6engine_7Tractor_35optimize_forced_photometry
Aborted
moustakas commented 1 year ago

Some relevant info: