Closed mhardcastle closed 10 months ago
Nice ! :)
Can you show me the command line and things I need to reproduce?
Oops sorry -- it's this
kMS.py --MSName L841876_123MHz_uv_pre-cal.ms --SolverType KAFCA --PolMode Scalar --BaseImageName image_full_ampphase_di_m --dt 43.630000 --NIterKF 6 --CovQ 0.100000 --LambdaKF=0.500000 --NCPU 32 --OutSolsName DDS3_full_slow --InCol DATA_DI_CORRECTED --DebugPdb=0 --Weighting Natural --UVMinMax=0.500000,1000.000000 --SolsDir=SOLSDIR --PreApplySols=[DDS3_full_smoothed] --NChanSols 1 --BeamMode LOFAR --PhasedArrayMode=A --DDFCacheDir=. --BeamAt=facet --NodesFile image_dirin_SSD_m.npy.ClusterCat.npy --DicoModel image_full_ampphase_di_m_masked.DicoModel
and data are in /beegfs/car/mjh/P182+85-new
on Herts cluster.
Ok so I saw the DATA_DI_CORRECTED
have super high values on some baselines, leading to crazy high Jones, crazy smooth and the error we are seeing. These are the DI solutions before (left) and now (right).
So something is obviously going wrong, I'll have to dig more to uderstand why the estimate goes to super low values on some stations (probably the predict is screwed up)...
In this commit I have commented part of the code that set the unsolved Jones matrices to be zero (that was connected to another bug). But the pipeline was partly run before this change, and so the DDFacet predicts are affected by that. So it could (or not:) be te cause of the problem and I'm rerunning the pipeline from scratch with consistent handling of the unsolved times. It will take a few days...
Ok - I ran from scratch and the changes I made don't change the problem unfortunatly. I'll have to dig more...
From some tests I was doing with SSD2 for some reason I was finding the DI cal was working far better with SSD2 compared to SSD. The clean component models looked incredibly similar though. I had also wondered if the SSD predict might be having some issues.
For the field where I was getting this bug I can try switching to SSD2 and seeing if that helps.
Well switching to SSD2 didnt make any difference sadly. I guess would have been surprising if it did.
Are there any tests I can do to help debug this one @cyriltasse? Looking at Martins images for P182+85 (/beegfs/car/mjh/P182+85-new versus /beegfs/car/mjh/P182+85) you can see that the quality in the images made with the new software is quite a bit worse already at the image_phase_di.
Yes I also see things happenning early on in the calibration... A colleague is visiting this week, and I won't have time to dig before next week though
@mhardcastle I'm back to this issue - is there a way I can run the older version of the pipeline?
I think you can use this - /data/lofar/mjh/ddf_py3_d11.simg
Thanks ! I've started a job till DIS0
with the older pipeline to try to understand the differences...
Martin had done a run with the old (/beegfs/car/mjh/P182+85) and new versions (/beegfs/car/mjh/P182+85-new). If you wanted to copy files or anything from them to save making some of the images again.
/data/lofar/mjh/ddf_py3_d11_ctmaster.sif is the version we've been using in production most recently and is what's used for the P182+85 image.
Yeah but I want/have to regenerate the perdicted columns as well, with the same clustering etc cos everything is tight together...
Trying to investigate where there may be differences between the new and old software. One test I thought was quite interesting was
So here I am simply changing the kms software version used for the calibration of the exact same dataset and exact same model.
I'm running back to back the two versions of the software, so far at the DIS0 step and all seem identical. Resuming to the step, I'll keep you posted. @twshimwell when do you start seing differences?
I was seeing them at the image_phase step (for my tests I had bootstrap turned off but I think it'd already be apparent in those maps as they use the DDS0 solutions)
Ok thanks! I have disabled bootstrap as well...
Ok - this took me a long while, but I have now something tested up to ampphase1 that give qualitatively similar results! There were differences in the SmoothSols.py
and in the way the non solved times were treated. These differences were propagating in the pipeline. I'm regenerating the ddf_dev.sif
so you can test it too. It should be finished in 10 minutes, but I'll let you know when it's done
Ok - you can have a try with /home/tasse/DDFSingularity/ddf_dev.sif
Thanks Cyril. I'll give it a go today.
It looks like the version of the actual pipeline in the singularity has a few things changed, like bootstrap commented out (we do have options for this you know...) so I'll remake a version of the singularity with the current working pipeline.
I don't get it I had disabled for my tests, but reenabled for building the image... And looking at the code in the image, I don't see what you you mention, can you point me to what you see?
My bad. I was looking at the commits for your branch but missed the final one where it's uncommented again.
I still probably want to use the bugfixes branch as there are other logic changes there, but let me check whether that's going to cause any issue.
Was trying this just on a quick test where im running the pipeline but just with 10SB, once with old software (the singularity image Martin uses for regular processing) and another with Cyrils new singularity map . Results are pretty different from last time...
Now at the image_dirin_SSD_di step I see that the new software is producing better images... Not sure why but they are substantially better. Presently up to the image_ampphase step and the new software images continue to look a fair bit better
It's not strikingly better on my tests but it is definitely different...
So much have changed. Differences can be quite many things, and I could dig if necessary. But as far as it's qualitatively similar it's good enough for me. What do you think?
So far so good as far as I'm concerned -- we'll know more in a couple of days.
Are you doing full bandwidth runs with the same clustering? Is that your P182+55-new folder and the P182+55 one?
Full bandwidth but not the same clustering, because I always forget to do that! But yes that's the comparison.
Hey - how are things going?
On smp4 I keep having this error
- 17:58:13 - ClassJones [8.0/13.1 9.6/15.7 68.6Gb] using cached Jones matrices from ./L841876_166MHz_uv_pre-cal.ms.F0.D0.ddfcache/R0:6151045/JonesNorm_killMS
- 17:58:13 - ClassJones [8.0/13.1 9.6/15.7 68.6Gb] ./L841876_166MHz_uv_pre-cal.ms.F0.D0.ddfcache/R0:6151045/JonesNorm_killMS.npz loaded
- 17:58:17 - ClassJones [8.9/13.1 10.5/15.7 68.6Gb] ./L841876_166MHz_uv_pre-cal.ms.F0.D0.ddfcache/R0:6151045/JonesNorm_killMS.npy loaded
- 17:58:18 - ClassJones [9.8/13.1 11.5/15.7 69.5Gb] using cached Jones matrices from ./L841876_166MHz_uv_pre-cal.ms.F0.D0.ddfcache/R0:6151045/JonesNorm_Beam.npz
- 17:58:18 - ClassJones [9.8/13.1 11.5/15.7 69.5Gb] ./L841876_166MHz_uv_pre-cal.ms.F0.D0.ddfcache/R0:6151045/JonesNorm_Beam.npz.npz loaded
- 17:58:18 - ClassJones [9.8/13.1 11.5/15.7 69.5Gb] ./L841876_166MHz_uv_pre-cal.ms.F0.D0.ddfcache/R0:6151045/JonesNorm_Beam.npz.npy loaded
- 17:59:15 - AsyncProcessPool [12.7/15.6 74.6/77.6 69.6Gb] Grid PSF 23.1: 94 jobs complete, average single-core time 36.17s per job
- 17:59:15 - ClassVisServer [12.7/15.6 74.6/77.6 69.6Gb] Delete shared dict /dev/shm/ddf.29742/DATA:22:0
- 17:59:16 - AsyncProcessPool [12.7/15.6 74.2/77.6 62.6Gb] Reading 24.1: 1 jobs complete, average single-core time 538.73s per job
- 17:59:16 - ClassImagerDeconv [12.7/15.6 74.6/77.6 62.6Gb] sparsify 0.000000
- 17:59:34 - ClassCasaImage [14.2/15.6 76.1/77.6 62.6Gb] ----> Save image data as FITS file image_full_ampphase_di.model.fits
- 17:59:36 - ClassImagerDeconv [12.7/15.6 74.6/77.6 62.6Gb] model image @[166.11175537] MHz (min,max) = (-0.072791, 0.875212)
- 18:01:33 - AsyncProcessPool [12.7/15.6 74.6/77.6 62.6Gb] Degrid 24.1: 94 jobs complete, average single-core time 36.38s per job
- 18:07:28 - AsyncProcessPool [12.7/15.6 74.6/77.6 62.7Gb] Grid 24.1: 94 jobs complete, average single-core time 41.88s per job
- 18:07:28 - AsyncProcessPool [12.7/15.6 74.6/77.6 62.7Gb] Stack Beam 24.1: 121 jobs complete, average single-core time 59.49s per job
- 18:09:17 - AsyncProcessPool [12.7/15.6 74.6/77.6 62.8Gb] Grid PSF 24.1: 94 jobs complete, average single-core time 37.77s per job
- 18:09:17 - ClassVisServer [12.7/15.6 74.6/77.6 62.8Gb] Delete shared dict /dev/shm/ddf.29742/DATA:23:0
- 18:09:18 - ClassImagerDeconv [12.7/15.6 74.2/77.6 55.8Gb] no more data: EndOfObservation
- 10:19:05 - AsyncProcessPool [1.5/13.1 3.1/15.7 55.8Gb] Ctrl+C caught, exiting
- 10:19:05 - AsyncProcessPool [1.5/7.7 4.0/13.6 55.8Gb] Ctrl+C caught, exiting
- 10:19:05 - AsyncProcessPool [1.5/7.7 4.0/13.6 55.8Gb] Ctrl+C caught, exiting
- 10:19:05 - AsyncProcessPool [1.5/7.7 4.0/13.2 55.8Gb] Ctrl+C caught, exiting
- 10:19:05 - AsyncProcessPool [1.5/7.7 4.0/13.3 55.8Gb] Ctrl+C caught, exiting
- 10:19:05 - AsyncProcessPool [1.4/7.8 4.0/13.6 55.8Gb] Ctrl+C caught, exiting
- 10:19:05 - AsyncProcessPool [1.4/7.8 4.0/13.6 55.8Gb] Ctrl+C caught, exiting
- 10:19:05 - AsyncProcessPool [1.4/7.8 4.0/13.6 55.8Gb] Ctrl+C caught, exiting
- 10:19:05 - AsyncProcessPool [1.4/7.8 4.0/13.6 55.8Gb] Ctrl+C caught, exiting
- 10:19:05 - AsyncProcessPool [1.4/7.7 4.0/13.5 55.8Gb] Ctrl+C caught, exiting
- 10:19:05 - AsyncProcessPool [1.5/7.7 4.0/13.4 55.8Gb] Ctrl+C caught, exiting
It's like it hangs and 4 days later (at 10:19) the Ctrl+C happens... On the shell there is also a bus error... Have you seen such thing too?
Ok - just understood the Ctrl+C was me :) preceded by a bus error. I'll try to why that happens, but I'd be interested to know if you've seen that too
I should have said: it happens at the image_full_ampphase_di
step
For the run im doing im seeing the image get a bit worse than the previous software at the image_phase step (and image_bootstrap) step. So something with the DDS0_smoothed solutions. Trying to do a run now to compare just the DDS0 solutions before the smoothing.
The image_dirin_SSD_c_m_di were pretty much identical for me though.
Hi for me it's the opposite (left : before, right: now)
And it's normal it's better it's because the clustering used to be rebular at this stage (an it should be using the clusterfile). I commented on that elsewhere.
You took the same clusterfile on both cases? and regelerate everything from the beginning with the latest singularity image or from a give step?
Yeh I have exactly the same cluster file and the runs are completely independent.
Where can I take a look at your run to see if I can spot any differences?
Hi both,
For what it's worth my test of P182+85 has finished and the quality is more or less the same between the two runs (note that they don't use the same cluster file).
In /beegfs/car/mjh/P182+85 and P182+85-new .
Cheers
Martin
Ah i think i sneakily found your runs @cyriltasse. Hmm your old run (assuming its TestMartinRerun_old) does look pretty bad for some reason. It looks much worse than the run with the old software that Martin did (i.e. /beegfs/car/mjh/P182+85). Not sure if you made any changes from the regular processing for the image_phase1 map in TestMartinRerun_old?
Yeah - it's in TestMartinRerun_old
and TestMartinRerun
Weird - the only thing I didn't do is the bootstraping that i skipped in both cases
free today for a chat and we can perhaps go through and take a look at things?
I'm around except 1-2 CEST and from 5 CEST.
I'm available after 14:00 and before 17:30 CEST
So perhaps 14:30 CEST?
Perfect.
For the record - this the residual at image_phase1 step (A+P applied, no weights, no smoothing), old vs new, on a single MS, same clustering
the new has less extreme error, but has more systematics (this is what we both see on the 6MS, smoothed, weighted pipeline versions). Now running with the same DicoModel. Probably it's something with the Kalman Filter. I'll dig
Hey,
I think my results are consistent with yours which is nice :)
Here left is new software and right is old software. Both images are the image_phase dirty (well residual as we give the dicomodel) with the weights set to None and the solutions unsmoothed.
My runs are in: /beegfs/car/shimwell/Cyril-tests. You can see a folder for the OLD and NEW software runs.
Both were run through the pipeline up to image_phase.
There are also the:
image_phase1_merged - not using smoothed solutions image_phase1_merged_noweight - not using smoothed solutions and Weight-ColName=None
I found a difference wetween the old and the new pipeline, I'll rerun to see if that explains it...
Ok I think this was the issue! The image_phase1
images look very similar now (with a slight advantage to the new version). I'm recompiling the image, I'll let you know when it's done
Wow thats great!
ok it's done compiling the dev image, you can use ddf_dev.sif
You have to start from scratch though (removing all columns, and using your reference Cluster file for fair comparison)
Still on the pipeline tests: