saopicc / DDFacet

DDFacet Imaging Project
GNU General Public License v2.0
14 stars 13 forks source link

Problem inverting Matrix, saving as errSVDArray_2453 #38

Open adrabent opened 2 years ago

adrabent commented 2 years ago

Dear all,

I am running the DDF-pipeline on a three-epoch observation. The quality of the data looks pretty good, according to the quality of the produced images. But during the bootstrap step of the pipeline, it crashes with the following error message:

 - 10:42:55 - ClassMultiScaleMachine       [16.2/18.9 30.6/33.4 27.7Gb] 48 scales and 7 scale functions in list
/usr/local/lib/python3.9/dist-packages/DDFacet/Imager/ClassPSFServer.py:336: RuntimeWarning: invalid value encountered in double_scalars
  FreqBandsFluxRatio[iAlpha,iChannel]=np.sqrt(np.sum(BeamFactor*((ThisFreqs/RefFreq)**ThisAlpha)**2))/np.sqrt(np.sum(BeamFactorWeightSq))
 - 10:43:04 - DDFacet                      [16.8/18.9 31.2/33.4 28.2Gb] Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/DDFacet/Array/ModLinAlg.py", line 229, in invSVD
    u,s,v=np.linalg.svd(Ar)
  File "<__array_function__ internals>", line 5, in svd
  File "/usr/lib/python3/dist-packages/numpy/linalg/linalg.py", line 1661, in svd
    u, s, vh = gufunc(a, signature=signature, extobj=extobj)
ValueError: On entry to DLASCL parameter number 4 had an illegal value

The extended error message looks like this:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/DDF.py", line 461, in <module>
    main(OP, messages)
  File "/usr/local/bin/DDF.py", line 295, in main
    Imager.main()
  File "/usr/local/lib/python3.9/dist-packages/DDFacet/Imager/ClassDeconvMachine.py", line 1222, in main
    repMinor, continue_deconv, update_model = self.DeconvMachine.Deconvolve()
  File "/usr/local/lib/python3.9/dist-packages/DDFacet/Imager/SSD/ClassImageDeconvMachineSSD.py", line 410, in Deconvolve
    self.InitIslands()
  File "/usr/local/lib/python3.9/dist-packages/DDFacet/Imager/SSD/ClassImageDeconvMachineSSD.py", line 323, in InitIslands
    self._init_InitMachine()
  File "/usr/local/lib/python3.9/dist-packages/DDFacet/Imager/SSD/ClassImageDeconvMachineSSD.py", line 151, in _init_InitMachine
    self.InitMachine.Init(self.DicoVariablePSF, self.GridFreqs, self.DegridFreqs)
  File "/usr/local/lib/python3.9/dist-packages/DDFacet/Imager/SSD/ClassInitSSDModelHMP.py", line 38, in Init
    self.InitMachine.Init(DicoVariablePSF, GridFreqs, DegridFreqs)
  File "/usr/local/lib/python3.9/dist-packages/DDFacet/Imager/SSD/ClassInitSSDModelHMP.py", line 243, in Init
    self.DeconvMachine.Init(PSFVar=self.DicoVariablePSF,PSFAve=self.DicoVariablePSF["PSFSideLobes"],
  File "/usr/local/lib/python3.9/dist-packages/DDFacet/Imager/MSMF/ClassImageDeconvMachineMSMF.py", line 199, in Init
    self.InitMSMF(approx=approx, cache=cache, facetcache=facetcache)
  File "/usr/local/lib/python3.9/dist-packages/DDFacet/Imager/MSMF/ClassImageDeconvMachineMSMF.py", line 285, in InitMSMF
    self._initMSM_facet(centralFacet,
  File "/usr/local/lib/python3.9/dist-packages/DDFacet/Imager/MSMF/ClassImageDeconvMachineMSMF.py", line 243, in _initMSM_facet
    MSMachine.MakeBasisMatrix()
  File "/usr/local/lib/python3.9/dist-packages/DDFacet/Imager/MSMF/ClassMultiScaleMachine.py", line 727, in MakeBasisMatrix
    self.DicoBasisMatrix = self.GiveBasisMatrix()
  File "/usr/local/lib/python3.9/dist-packages/DDFacet/Imager/MSMF/ClassMultiScaleMachine.py", line 770, in GiveBasisMatrix
    DicoBasisMatrix["BMT_BM_inv"] = np.float32(ModLinAlg.invSVD(BMT_BM))
  File "/usr/local/lib/python3.9/dist-packages/DDFacet/Array/ModLinAlg.py", line 238, in invSVD
    u,s,v=np.linalg.svd(np.complex64(Ar))#+np.random.randn(*Ar.shape)*(1e-10*np.abs(Ar).max()))
  File "<__array_function__ internals>", line 5, in svd
  File "/usr/lib/python3/dist-packages/numpy/linalg/linalg.py", line 1661, in svd
    u, s, vh = gufunc(a, signature=signature, extobj=extobj)
ValueError: On entry to DLASCL parameter number 4 had an illegal value

and it finishes with:

 - 10:43:04 - DDFacet                      [14.6/18.9 29.1/33.4 28.2Gb] There was a problem after 3m7.5s; if you think this is a bug please open an issue, 
 - 10:43:04 - DDFacet                      [14.6/18.9 29.1/33.4 28.2Gb]   quote your version of DDFacet and attach your logfile.
 - 10:43:04 - DDFacet                      [14.6/18.9 29.1/33.4 28.2Gb] You are using DDFacet revision: 0.6.0.0
 - 10:43:04 - DDFacet                      [14.6/18.9 29.1/33.4 28.2Gb] Your logfile is available here: /data/LOFAR/HBA/A2319_merged/image_bootstrap_L823390.log
Problem inverting Matrix, saving as errSVDArray_2453
  will make it svd-able
 - 10:43:08 - ClearSHM                     | Clear shared memory
 - 10:43:08 - ClearSHM                     | Clear Semaphores
 - 10:43:08 - ClearSHM                     | Clear shared dictionaries

followed by Tracebacks just mentioning the occurence of the Runtime Error. I attached the logfile to this ticket: image_bootstrap_L823390.log

What can I do to figure out why DDF fails? Was such a problem seen before?

With kind regards, Alex

cyriltasse commented 2 years ago

Can you look at the images that have been produced already, you might see something obvious? Maybe look at the dirty or intermediate residual, or psf?

adrabent commented 2 years ago

Dear @cyriltasse:

I do not see any obvious wrong thing. Here are the images (from the bootstrap step):

restored: restored

PSF: psf

mask: mask

dirty: dirty

Those images look fine to me, but maybe I am overlooking something.

Cheers, Alex

cyriltasse commented 2 years ago

Yeah weird... can you save/check the cubes as well? Adding to you last DDF.py call --Output-Cubes dp, will save you dirty and psf cubes as well - there could be an empty slice or something more weird... If not I'll take a look...

adrabent commented 2 years ago

Okay, I tried:

 /usr/local/bin/DDF.py --Output-Name=image_bootstrap_L823390 --Data-MS=temp_mslist.txt --Deconv-PeakFactor 0.100000 --Data-ColName DATA_DI_CORRECTED --Parallel-NCPU=80 --Beam-CenterNorm=1 --Deconv-CycleFactor=0 --Deconv-MaxMinorIter=1000000 --Deconv-MaxMajorIter=5 --Deconv-Mode SSD --Beam-Model=LOFAR --Beam-LOFARBeamMode=A --Weight-Robust -0.250000 --Image-NPix=6000 --CF-wmax 50000 --CF-Nw 100 --Output-Also onNeds --Image-Cell 4.500000 --Facets-NFacets=11 --SSDClean-NEnlargeData 0 --Freq-NDegridBand 1 --Beam-NBand 1 --Facets-DiamMax 1.5 --Facets-DiamMin 0.1 --Deconv-RMSFactor=3.000000 --SSDClean-ConvFFTSwitch 10000 --Data-Sort 1 --Cache-Dir=. --Cache-DirWisdomFFTW=. --Debug-Pdb=never --Log-Memory 1 --GAClean-RMSFactorInitHMP 1.000000 --GAClean-MaxMinorIterInitHMP 10000.000000 --DDESolutions-SolsDir=SOLSDIR --Cache-Weight=reset --Misc-IgnoreDeprecationMarking=1 --Beam-At=facet --Output-Mode=Clean --Output-RestoringBeam 20.000000 --Weight-ColName=IMAGING_WEIGHT --Output-Cubes I --Freq-NBand=15 --RIME-DecorrMode=FT --SSDClean-SSDSolvePars [S,Alpha] --SSDClean-BICFactor 0 --Mask-Auto=1 --Mask-SigTh=15.00 --Mask-External=bootstrap_external_mask.fits --DDESolutions-DDModeGrid=AP --DDESolutions-DDModeDeGrid=AP --DDESolutions-DDSols=DDS0 --Selection-UVRangeKm=[0.100000,25.750000] --Cache-Dirty forcedirty --Cache-PSF force --GAClean-MinSizeInit=10 --Beam-Smooth=1 --Output-Cubes dp

But I only got those cubes:

image_bootstrap_L823362.cube.int.restored.fits             image_bootstrap_L823362.cube.int.restored.pybdsm.srl       image_bootstrap_L823376.cube.int.restored.pybdsm_rms.fits
image_bootstrap_L823362.cube.int.restored.fits.pybdsf.log  image_bootstrap_L823376.cube.int.restored.fits             image_bootstrap_L823376.cube.int.restored.pybdsm.srl
image_bootstrap_L823362.cube.int.restored.pybdsm_rms.fits  image_bootstrap_L823376.cube.int.restored.fits.pybdsf.log

But I see that in case of L823362 the last 6 "cube channels" show invalid data. How can I deal with that? Removing these bands?

Alex

cyriltasse commented 2 years ago

Ok cool! getting closer :)

All the data is flagged in these bands?

adrabent commented 2 years ago

No.. not really, but the fraction of flagged data is pretty high (~30 .. 45%). I checked again and the "invalid" data images in casaviewer only appear if I open both cubes (L823362 and L823376) at the same, i.e., both fields do not have an identical frequency coverage (mostly some bands in the middle are missing).

Could this cause an issue?