mhardcastle / ddf-pipeline

LOFAR pipeline using killms/ddfacet
GNU General Public License v2.0
23 stars 19 forks source link

Comparison between new DDF/kMS and old DDF/kMS #337

Open twshimwell opened 7 months ago

twshimwell commented 7 months ago

This issue is for the final checks of the new DDF/kMS before switching LoTSS processing to use the new rather than old versions.

6 LoTSS fields (P030+51, P054+36, P216+15, P269+63, P316+81 and P347+08) are being processed using the same parset and facet layout with the new and old versions of DDF/kMS.

I shall update this ticket as the fields finish processing but the new maps are stored in:

/beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/P???+??

Whereas the old ones are in:

/beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/comparison-maps/P???+??

twshimwell commented 7 months ago

The first run to produce final maps is P054+36. Attached is a comparison image_full_ampphase_di_m.NS.app.restored.fits. Left is with new kMS/DDF (/home/tasse/DDFSingularity/ddf_dev.sif and using the pipeline version in this pull request - https://github.com/mhardcastle/ddf-pipeline/pull/336) and right is the old version (from our LoTSS archive) that was made a few months ago using the old kMS/DDF (/data/lofar/mjh/ddf_py3_d11_ctmaster.sif).

When blinking between the images it is pretty tough to see any difference at all.

Screenshot 2023-11-13 at 15 25 43

twshimwell commented 7 months ago

Comparison between final P316+81 maps. Left new and right old.

Here arguably the new in a little worse with some negative holes around some of the brighter sources. The rms on the maps is a few % higher too (probably due to artifacts from the negative holes).

Screenshot 2023-11-14 at 11 21 14

cyriltasse commented 7 months ago

That's bad... Needs investigation... You see the holes in all facets? At what step do the differences appear? Could be the smoothing, in that case we could try to smooth the new solutions with the old SmoothSols.py and see what happens?

twshimwell commented 7 months ago

Hmmm it might be something to do with the DDS3_full_slow_merged solutions. The holes seem to appear (well a bit hard to tell for sure because of differences in artefacts) for the first time in the final image (image_full_ampphase_di_m.NS) and are not really apparent in the previous image_full_ampphase_di_m map.

Left is the image_full_ampphase_di_m.NS and right is the image_full_ampphase_di_m map. Colourscale is the same.

Screenshot 2023-11-14 at 11 48 15

twshimwell commented 7 months ago

P030+51 is giving a more weird issue in that the bootstrap image comes out very oddly (see below).

The two low declination fields both hang on the first kMS call for some reason. Ive tried clearing cache and using different nodes but they hang at 99% of the kMS on the first msfile. Odd.

Screenshot 2023-11-16 at 14 57 00
mhardcastle commented 7 months ago

Well that's a killms fail at the first dd stage... but presumably works OK in the old pipeline? or not?

twshimwell commented 7 months ago

Yeh I don’t have those images from the old pipeline but it worked ok because the bootstrap mask images are propagated and for my new run those are pretty bad.

Indeed I think something from with the amplitude solutions on kMS but will do a test to check

Sent from Outlook for iOShttps://aka.ms/o0ukef


From: Martin Hardcastle @.> Sent: Thursday, November 16, 2023 3:09:23 PM To: mhardcastle/ddf-pipeline @.> Cc: Timothy Shimwell @.>; Author @.> Subject: Re: [mhardcastle/ddf-pipeline] Comparison between new DDF/kMS and old DDF/kMS (Issue #337)

Well that's a killms fail at the first dd stage... but presumably works OK in the old pipeline? or not?

— Reply to this email directly, view it on GitHubhttps://github.com/mhardcastle/ddf-pipeline/issues/337#issuecomment-1814504763, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AC4BT5XONQCWK7SYDEMGVWDYEYNBHAVCNFSM6AAAAAA7JG67A2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJUGUYDINZWGM. You are receiving this because you authored the thread.Message ID: @.***>

twshimwell commented 7 months ago

OK so reimaging the bootstrap bit with just P appliied rather than AP kind of solves the issue. So the amplitude solutions are indeed really poor.

Looking at other data products I also see that for these fields the image_dirin_SSD is better than the image_dirin_SSD_m_c_di. So there are also issues still at the DI solving step. I'll do a direct comparison with the old pipeline so we compare these early pipeline products.

Hmm.

twshimwell commented 7 months ago

So the bootstrap issue seemed somewhat due to the fast dt_di that we use in the parset. Let the dt_di be automatically choose it gives a slower dt_di and seemed to produce better results (i suspect in some cases would be better still if it were a bit longer intervals still) with less diverging facets. Will continue runs with automatically chosen dt_di.

Note that for some reason the old kMS did not suffer so significantly with the di_dt being very fast (still not good through with the image_dirin_SSD being generally better than the image_dirin_SSD_m_c_di).

twshimwell commented 7 months ago

P269+63 - no major differences for this one between the new and old kMS/DDF versions (different dt_di was used in the older reduction due to ticket #315)

twshimwell commented 7 months ago

So to-date:

P269+63 & P054+36 approximately comparable between new and old versions.

P216+15 -- having issues running so no results yet.

All the other fields get worse unfortunately.

P030+51 - big differences at the bootstrap step with new kMS apparently producing much worse solutions from the DI maps.

P316+81 - negative holes in new kMS/DDF final maps but not in old one.

P347+08 - bright source away from centre causes much bigger issues in new DDF/kMS compared to old one. Resulting image is far noisier.

The simplest field to test things on seems P030+51 because it goes wrong so early in the pipeline. I can try e.g. using old kMs one the run with the new DDF and seeing the results so we have a really clean comparison with just a single kMS step (no smooth sols or DDF pipeline). Then I think it might require @cyriltasse to take a look at kMS.

If anyone has suggestions for things to try let me know.

twshimwell commented 7 months ago

For P030+51 ive data in /beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/P030+51-SAMEDICO-NEWSOFT and /beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/P030+51-SAMEDICO-OLDSOFT to investigate why the image_bootstrap maps are so different for this field when changing from old to new software.

These runs were stopped soon after the bootstrap mapping so contain all the appropriate weights. The attached shows images made using the command below. Here the left is the new software dirty image and right the old software dirty image:

DDF.py --Misc-ConserveMemory=1 --Output-Name=image_test --Data-MS=mslist.txt --Deconv-PeakFactor 0.001000 --Data-ColName DATA_DI_CORRECTED --Parallel-NCPU=32 --Beam-CenterNorm=1 --Deconv-CycleFactor=0 --Deconv-MaxMinorIter=1000000 --Deconv-MaxMajorIter=2 --Deconv-Mode SSD --Beam-Model=LOFAR --Weight-Robust -0.150000 --Image-NPix=20000 --CF-wmax 50000 --CF-Nw 100 --Output-Also onNeds --Image-Cell 1.500000 --Facets-NFacets=11 --SSDClean-NEnlargeData 0 --Freq-NDegridBand 1 --Beam-NBand 1 --Facets-DiamMax 1.5 --Facets-DiamMin 0.1 --Deconv-RMSFactor=3.000000 --SSDClean-ConvFFTSwitch 10000 --Data-Sort 1 --Cache-Dir=. --Cache-DirWisdomFFTW=. --Debug-Pdb=never --Log-Memory 1 --GAClean-RMSFactorInitHMP 1.000000 --GAClean-MaxMinorIterInitHMP 10000.000000 --DDESolutions-SolsDir=SOLSDIR --Cache-Weight=reset --Beam-PhasedArrayMode=A --Misc-IgnoreDeprecationMarking=1 --Beam-At=facet --Output-Mode=Clean --Predict-ColName DD_PREDICT --Output-RestoringBeam 12.000000 --Weight-ColName="IMAGING_WEIGHT" --Freq-NBand=2 --RIME-DecorrMode=FT --SSDClean-SSDSolvePars [S,Alpha] --SSDClean-BICFactor 0 --Mask-Auto=1 --Mask-SigTh=5.00 --Mask-External=image_dirin_SSD_m_c_di_m.app.restored.fits.mask.fits --DDESolutions-GlobalNorm=None --DDESolutions-DDModeGrid=AP --DDESolutions-DDModeDeGrid=AP --DDESolutions-DDSols=DDS0 --Predict-InitDicoModel=image_dirin_SSD_m_c_di_m.DicoModel --Selection-UVRangeKm=[0.100000,1000.000000] --GAClean-MinSizeInit=10

Screenshot 2023-12-05 at 12 24 52

cyriltasse commented 6 months ago

Hi -

So after 10 days of intense debugging I managed to understand what was going on, and now the new and old version of killMS virtually give the same results (left/right are old/new)

image

There were two bugs:

The new kMS has many more functionalities and is faster than the older one by ~30-40%

I'll design a kMS test function to do this more automatically in the future

cyriltasse commented 6 months ago

Ok I've build the new singularity image: /home/tasse/DDFSingularity/ddf_dev.sif

@twshimwell ready for testing

mhardcastle commented 6 months ago

Let me know if you need some nodes @twshimwell .

twshimwell commented 6 months ago

Great stuff Cyril. @mhardcastle I have 6 test fields and I already have node066. If possible can I either get another 5 nodes so I can do all 6 fields in parallel. Alternatively, perhaps 2more nodes and then ill do two batches of 3 runs. many thanks

mhardcastle commented 6 months ago

I've reserved nodes001-005 for you, so go crazy (but remind me to remove the reservation again afterwards!).

twshimwell commented 6 months ago

Thanks. Just restarting all the runs. Should have some results by Monday hopefully. Had to choose a few different fields due to LINC reprocessing near Ateam sources. Anyway, that shouldn't really matter.

cyriltasse commented 5 months ago

Well - not sure why but I don't see it faster now, it's just the same speed... I'll dig a bit

cyriltasse commented 5 months ago

Well I guess one starts to see differeces in computing time when the datasets are bigger...

mhardcastle commented 5 months ago

We can look out for this in Tim's tests, he's using the same type of machine for both and the timing info is all in the logs and can be extracted by the plotting code. Although timing differences of tens of per cent can easily be caused just by some process hitting the storage. I have 26 ddf-pipeline instances running at Herts right now so things may be a bit slow.

twshimwell commented 5 months ago

hmmm im getting DDFacet issues on the image_dirin_SSD_m_c image (so the first one with the clustering). The command run is e.g.

DDF.py --Misc-ConserveMemory=1 --Output-Name=image_dirin_SSD_m_c --Data-MS=mslist.txt --Deconv-PeakFactor 0.001000 --Data-ColName DATA --Parallel-NCPU=32 --Beam-CenterNorm=1 --Deconv-CycleFactor=0 --Deconv-MaxMinorIter=1000000 --Deconv-MaxMajorIter=1 --Deconv-Mode SSD --Beam-Model=LOFAR --Weight-Robust -0.150000 --Image-NPix=20000 --CF-wmax 50000 --CF-Nw 100 --Output-Also onNeds --Image-Cell 1.500000 --Facets-NFacets=11 --SSDClean-NEnlargeData 0 --Freq-NDegridBand 1 --Beam-NBand 1 --Facets-DiamMax 1.5 --Facets-DiamMin 0.1 --Deconv-RMSFactor=0.000000 --SSDClean-ConvFFTSwitch 10000 --Data-Sort 1 --Cache-Dir=. --Cache-DirWisdomFFTW=. --Debug-Pdb=never --Log-Memory 1 --GAClean-RMSFactorInitHMP 1.000000 --GAClean-MaxMinorIterInitHMP 10000.000000 --DDESolutions-SolsDir=SOLSDIR --Cache-Weight=reset --Beam-PhasedArrayMode=A --Misc-IgnoreDeprecationMarking=1 --Beam-At=facet --Output-Mode=Clean --Predict-ColName DD_PREDICT --Output-RestoringBeam 12.000000 --Weight-ColName="None" --Freq-NBand=2 --RIME-DecorrMode=FT --SSDClean-SSDSolvePars [S,Alpha] --SSDClean-BICFactor 0 --Facets-CatNodes=/beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/new-runs/uv-misc-files/P090+77/image_dirin_SSD_m.npy.ClusterCat.npy --Mask-Auto=1 --Mask-SigTh=15.00 --Mask-External=image_dirin_SSD.app.restored.fits.mask.fits --Predict-InitDicoModel=image_dirin_SSD_m.DicoModel --Selection-UVRangeKm=[0.100000,1000.000000] --GAClean-MinSizeInit=10

The error looks like:

twshimwell commented 5 months ago

(the previous singularity I was using with the newer software was /beegfs/car/mjh/DDFSingularity/ddf.sif which I believe is built from /beegfs/car/mjh/DDFSingularity/ddf-py3.singularity. This one the DDFacet seemed to not have this issue. Also using the latest version of ddf-pipeline master branch is fine for the new singularity image)

mhardcastle commented 5 months ago

That's just a copy of Cyril's directory, so shouldn't be any different in itself.

Is this a clean run from the start?

twshimwell commented 5 months ago

oh strange. Yes these are all clean runs from the start. Only thing different from normal runs is that I specify the clusterfile so that we have the same directions as the old comparison runs.

cyriltasse commented 5 months ago

@twshimwell can you point me to where the data is?

twshimwell commented 5 months ago

its in /beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/new-runs. There is a folder for each field and in the /beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/new-runs/config-files folder are the pipeline parset files.

twshimwell commented 5 months ago

example logfile with error is e.g. /beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/new-runs/P030+51/image_dirin_SSD_m_c.log

cyriltasse commented 5 months ago

It's weirdbecause I was working on P030+51 too and I did not have this error... For the clusterfile I just copied the one of the old processing into to new directory and let the pipeline skip the computation, but I hrdly see how this could happen

twshimwell commented 5 months ago

Ah ok. Wonder why i see the issue now then. I see it on all the fields im trying. So you simply copied the clusterfile and not any of the image_dirin_SSD_m_c products?

cyriltasse commented 5 months ago

Ok I've found a merging issue that has gone wrong on the branch I was using (but my local branch was not updated with this bug). I've updated, fixed and pushed. And recompiled the image - you can probably resume your tests!

twshimwell commented 5 months ago

great thanks. All resumed now and made it past this step.

twshimwell commented 5 months ago

So far so good. Have the bootstrap images for all the fields now and they look good. So where I have corresponding images with the old software I can see they look very similar :)

cyriltasse commented 5 months ago

Very good! Keep us updated...

twshimwell commented 5 months ago

4 fields done the image_ampphase_di images already. Results are:

P090+77 - comparable but v. slightly improved upon previous old software version :) P030+51 - quite improved upon previous old software version (first image below showing what i mean by quite improved) P050+26 - almost identical quality to with old software. P316+81 - really quite improved upon previous old software version (second image below showing what i mean by really quite improved)

So thats really good news. I should have the first full bandwidth images tomorrow I hope and fingers crossed things still look fine/good (previously we sometimes saw better intermediate products but comparable final products so maybe ones that are better now will become equal again at the new step or something).

In the meantime thanks Cyril for putting so much effort into getting this working again. Maybe im too optimistic but perhaps we should prepare to switch to the new software. We can then also turn dynspecms back on as thats been off for a number of months now. The pipeline should be all ready for the new software (due to these changes with the weights https://github.com/mhardcastle/ddf-pipeline/pull/336 which went into master already).

Images have new software runs on left and old software runs on the right. Colour scales the same and processing used the same facet layout.

Screenshot 2024-01-10 at 11 01 22

Screenshot 2024-01-10 at 11 11 42

twshimwell commented 5 months ago

Maybe small issue with kMS. One thing that is impacting the run for P050+26 is that kMS is hanging forever after writing to the logs:

2024-01-10 09:55:12: - 09:55:12 - ClassVisServer | Estimating PreApply directions at the center of the individual facets areas 2024-01-10 09:55:12: - 09:55:12 - ClassVisServer | PreApply Jones matrix has 94 directions 2024-01-10 09:55:12: - 09:55:12 - ClassVisServer | Update LOFAR beam in 94 directions [Dt = 5.0 min] ... 2024-01-10 09:56:18: - 09:56:18 - ClassVisServer | .... done Update LOFAR beam 2024-01-10 09:56:18: - 09:56:18 - ClassJonesDomains | Building VisToJones time mapping... 2024-01-10 09:56:19: - 09:56:19 - ClassJonesDomains | Building VisToJones freq mapping... 2024-01-10 09:56:19: - 09:56:19 - ClassVisServer | VisToSolsChanMapping [0, 0, 0, 0, 0, 0, 0, 0, ...., 0, 0, 0, 0, 0, 0, 0, 0], (len = 20) 2024-01-10 09:56:19: - 09:56:19 - ClassVisServer | Load vis chunk #0: done 2024-01-10 09:56:19: - 09:56:19 - ClassWirtingerSolver | DT=28801.000000, dt=60.000000, nt=480.000000 2024-01-10 09:56:19: Solving [Chunk 1/1] ............ 0/480 [ ] 00% - 0'00" 2024-01-10 09:56:19: - 09:56:19 - ClassWirtingerSolver | AllFlaggedThisTime 2024-01-10 09:56:19: - 09:56:19 - ClassWirtingerSolver | AllFlaggedThisTime 2024-01-10 09:56:19: - 09:56:19 - ClassWirtingerSolver | AllFlaggedThisTime 2024-01-10 09:56:19: - 09:56:19 - ClassWirtingerSolver | AllFlaggedThisTime 2024-01-10 09:56:19: - 09:56:19 - ClassWirtingerSolver | AllFlaggedThisTime 2024-01-10 09:56:19: - 09:56:19 - ClassWirtingerSolver | AllFlaggedThisTime 2024-01-10 09:56:19: - 09:56:19 - ClassWirtingerSolver | AllFlaggedThisTime 2024-01-10 09:56:19: - 09:56:19 - ClassWirtingerSolver | AllFlaggedThisTime 2024-01-10 09:56:19: - 09:56:19 - ClassWirtingerSolver | AllFlaggedThisTime

cyriltasse commented 5 months ago

Ok - probably a problem withe the async load... I'll try to fix asap

cyriltasse commented 5 months ago

Can you tell me on which MS exactly this is happenning and with which command line?

twshimwell commented 5 months ago

Thanks!

Command is:

kMS.py --MSName L874610_121MHz_uv_pre-cal.ms --SolverType KAFCA --PolMode Scalar --BaseImageName image_ampphase1_di --NIterKF 6 --CovQ 0.100000 --LambdaKF=0.500000 --NCPU 32 --OutSolsName DDS2_full --I nCol SCALED_DATA --DebugPdb=0 --Weighting Natural --UVMinMax=0.100000,1000.000000 --SolsDir=SOLSDIR --NChanSols 1 --BeamMode LOFAR --PhasedArrayMode=A --DDFCacheDir=. --BeamAt=facet --NodesFile /beegfs/car/shimwell/Cyril-tests/FINA L-TESTS/new-runs/uv-misc-files/P050+26/image_dirin_SSD_m.npy.ClusterCat.npy --DicoModel image_ampphase1_di_masked.DicoModel --dt 1.000000

Folder for the run is:

/beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/new-runs/P050+26

twshimwell commented 5 months ago

Error occurs after running for about 5 mins.

twshimwell commented 5 months ago

Not sure if it is related but I also have one other issue on field P347+08 (directory /beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/new-runs/P347+08) when running the command:

kMS.py --MSName L762181_125MHz_uv_pre-cal.ms --SolverType KAFCA --PolMode Scalar --BaseImageName PredictDI_0 --NIterKF 6 --CovQ 0.100000 --LambdaKF =0.500000 --NCPU 32 --OutSolsName DIS0 --InCol DATA --DebugPdb=0 --Weighting Natural --UVMinMax=0.100000,1000.000000 --SolsDir=SOLSDIR --SolverType CohJones --PolMode IFull --Sk yModelCol DD_PREDICT --OutCol DATA_DI_CORRECTED --ApplyToDir 0 --dt 0.200100 --NChanSols 5

kMs goes happily along until it hits 99% and then hangs forever. Just outputting e.g.

2024-01-10 13:56:02: Solving [Chunk 1/1] ...........1197/1200 [=========================================== ] 99% - 16'24"

twshimwell commented 5 months ago

First full bandwidth images for the P090+77 field and the image_full_ampphase_di looks slightly improved upon the old software version :) Just one more calibration step to to go but so far so good besides the two hanging issues in the previous posts.

cyriltasse commented 5 months ago

Very good!

I think I've just finished fixing both issues above (dealing - again - with async stuff). I'll finish testing, recompile the image and let you know

twshimwell commented 5 months ago

Super!

cyriltasse commented 5 months ago

Ok it's going through now, and I've recompiled the image, you can proceed with the tests!

twshimwell commented 5 months ago

Thanks. Seems like both are now working.

twshimwell commented 5 months ago

more full bandwidth images (image_full_ampphase_di).

Now status is:

P030+51 - almost identical to with old software P316+81 - a little bit improved.

So still all good. Earlier in the pipeline processing P030+51 was looking much better with the new software (see a few posts ago) but is now approximately equal when all the bandwidth is included. Not sure why this happens but we had also seen this in the previous runs which worked (albeit there were the 2 bugs Cyril fixed in kMS in those previous runs).

cyriltasse commented 5 months ago

Really cool! How many more fields to go?

twshimwell commented 5 months ago

the 2 fields impacted by the bug you fixed are a few days behind. Hopefully we have final images for at least 4 fields by Monday though and then the following 2 fields soon afterwards.

Once done we can vote on switching to the new software straight away if @cyriltasse you can't think of any other things we should test.

twshimwell commented 5 months ago

And one more new issue has popped up. When doing the QU imaging we get error:

We run with --RIME-PolMode=QU but apparently this is no longer possible.