mhardcastle / ddf-pipeline

LOFAR pipeline using killms/ddfacet
GNU General Public License v2.0
23 stars 20 forks source link

Broken pipe error #215

Closed duyhoang-astro closed 4 years ago

duyhoang-astro commented 4 years ago

Hi,

I am running ddf-pipeline and having a broken pipe error. Anyone has an idea on the error and how to fix it? I am running the pipeline from a Singularity image on a small machine (6 cores, 32 GB of RAM, 60 GB of /tmp). I know the machine is not as powerful as the suggested requirements for the pipeline, but not sure if this is related to the error here.


Traceback (most recent call last): File "/usr/lib64/python2.7/multiprocessing/queues.py", line 266, in _feed send(obj) IOError: [Errno 32] Broken pipe ESC[91mFAILED to run DDF.py --Output-Name=image_full_wide --Data-MS=mslist.txt --Deconv-PeakFactor 0.001000 --Data-ColName CORRECTED_DATA --Parallel-NCPU=6 --Beam-CenterNorm=1 --Deconv-CycleFactor=0 --Deconv-MaxMinorIter=1000000 --Deconv-MaxMajorIter=2 --Deconv-Mode SSD --Beam-Model=LOFAR --Beam-LOFARBeamMode=A --Weight-Robust -0.200000 --Image-NPix=10000 --CF-wmax 50000 --CF-Nw 100 --Output-Also onNeds --Image-Cell 10.000000 --Facets-NFacets=11 --SSDClean-NEnlargeData 0 --Freq-NDegridBand 1 --Beam-NBand 1 --Facets-DiamMax 1.5 --Facets-DiamMin 0.1 --Deconv-RMSFactor=3.000000 --SSDClean-ConvFFTSwitch 10000 --Data-Sort 1 --Cache-Dir=. --Log-Memory 1 --GAClean-RMSFactorInitHMP 1.000000 --GAClean-MaxMinorIterInitHMP 10000.000000 --GAClean-AllowNegativeInitHMP True --DDESolutions-SolsDir=SOLSDIR --Cache-Weight=reset --Output-Mode=Clean --Output-RestoringBeam 45.000000 --Weight-ColName="None" --Freq-NBand=2 --RIME-DecorrMode=FT --SSDClean-SSDSolvePars [S,Alpha] --SSDClean-BICFactor 0 --Mask-Auto=1 --Mask-SigTh=15.00 --Selection-UVRangeKm=[0.100000,11.444444] --GAClean-MinSizeInit=10 --Beam-Smooth=1: return value is 1ESC[0m Traceback (most recent call last): File "/opt/lofar/ddf-pipeline/scripts/pipeline.py", line 1819, in main(o) File "/opt/lofar/ddf-pipeline/scripts/pipeline.py", line 1001, in main subtractOuterSquare(o) File "/opt/lofar/ddf-pipeline/scripts/pipeline.py", line 828, in subtractOuterSquare catcher=catcher) File "/opt/lofar/ddf-pipeline/scripts/pipeline.py", line 286, in ddf_image run(runcommand,dryrun=options['dryrun'],log=logfilename('DDF-'+imagename+'.log',options=options),quiet=options['quiet']) File "/opt/lofar/ddf-pipeline/utils/auxcodes.py", line 54, in run die('FAILED to run '+s+': return value is '+str(retval)) File "/opt/lofar/ddf-pipeline/utils/auxcodes.py", line 36, in die raise Exception(s) Exception: FAILED to run DDF.py --Output-Name=image_full_wide --Data-MS=mslist.txt --Deconv-PeakFactor 0.001000 --Data-ColName CORRECTED_DATA --Parallel-NCPU=6 --Beam-CenterNorm=1 --Deconv-CycleFactor=0 --Deconv-MaxMinorIter=1000000 --Deconv-MaxMajorIter=2 --Deconv-Mode SSD --Beam-Model=LOFAR --Beam-LOFARBeamMode=A --Weight-Robust -0.200000 --Image-NPix=10000 --CF-wmax 50000 --CF-Nw 100 --Output-Also onNeds --Image-Cell 10.000000 --Facets-NFacets=11 --SSDClean-NEnlargeData 0 --Freq-NDegridBand 1 --Beam-NBand 1 --Facets-DiamMax 1.5 --Facets-DiamMin 0.1 --Deconv-RMSFactor=3.000000 --SSDClean-ConvFFTSwitch 10000 --Data-Sort 1 --Cache-Dir=. --Log-Memory 1 --GAClean-RMSFactorInitHMP 1.000000 --GAClean-MaxMinorIterInitHMP 10000.000000 --GAClean-AllowNegativeInitHMP True --DDESolutions-SolsDir=SOLSDIR --Cache-Weight=reset --Output-Mode=Clean --Output-RestoringBeam 45.000000 --Weight-ColName="None" --Freq-NBand=2 --RIME-DecorrMode=FT --SSDClean-SSDSolvePars [S,Alpha] --SSDClean-BICFactor 0 --Mask-Auto=1 --Mask-SigTh=15.00 --Selection-UVRangeKm=[0.100000,11.444444] --GAClean-MinSizeInit=10 --Beam-Smooth=1: return value is 1 ESC[34mINFO: ESC[0m Cleaning up image...

twshimwell commented 4 years ago

Hey Duy, id have guessed it is probably related to the small machine. Do you have a much larger one with say 200GB of ram?

mhardcastle commented 4 years ago

Indeed. The pipeline can't run (with normal settings on normal data) on a machine anything like this small. The broken pipe is probably from workers being killed by the OOM killer in the kernel.

duyhoang-astro commented 4 years ago

Hey Tim, Martin. Thanks for the replies. Yes, we have more powerful nodes (128GB and 256 GB of RAM). I am just testing if the pipeline works on a small machine to expand the computing capacity. Can the pipeline distribute its tasks to multiple nodes? We have a number of small nodes here. It would be nice if it works on these small nodes.

mhardcastle commented 4 years ago

No, it can't distribute the tasks. DDFacet and kMS are the limiting factors here. There was some interest in distributed DDFacet a while back but it didn't come to anything. We have tested pretty extensively ourselves and 192 GB is the minimum RAM for a full pipeline run with the standard configuration.

duyhoang-astro commented 4 years ago

Thanks Martin for the info. Cheers.