phase-only solver regression in 1.5.0

o-smirnov commented 4 years ago

Reported here: https://github.com/caracal-pipeline/caracal/issues/1109

I already have a solution, so filing this just to keep the paperwork complete.

o-smirnov commented 4 years ago

So mainly the regression was due to a bug with the phase-only solver that I introduced in 1.5.0. It affects both 2-corr and 4-corr data (unlike what I suggested earlier), but does not arise when a Jones chain is used (so @bennahugo did not see it because he was doing G+dE all along).

(@JSKenyon, this is the same bug you found with _gh_update. It seems we fixed it everywhere except the phase-only solver...)

The bug is now fixed on this branch, and I will initiate a PR and release shortly, just doing a bit more testing.

In the process, I turned up other curious stuff with @paoloserra's dataset, which I will document below.

o-smirnov commented 4 years ago

Other interesting points and interactions of bugs and features:

There is some kind of RFI in Paolo's test MS, and I get best results when I madmax it immediately (i.e. 10 and 12 is better than 0,10 and 0,12!)
There is a bug in CubiCal 1.4.x which caused --madmax-threshold 0,10 --madmax-threshold 0,12 (as specified in Paolo's old command) to act like --madmax-threshold 10 --madmax-threshold 12 from the second tile onwards. This probably improved his "old" results a little bit, for the reason above.
If madmax is not done initially on the residuals (i.e. with 0,10 0,12 as opposed to 10 12), the very first iteration of some chunks diverges, and therefore stops. Paolo can see it in his old log for the first ~3 chunks (look for lines with "end solve", 1 iters).
If madmax is done initially on the residuals (i.e. 10 12), these "bad" chunks iterate to some kind of solution, I assume because the RFI is eliminated right away. Due to the above-mentioned bug, this was happening in Paolo's old run anyway, from the second tile onwards.
CubiCal 1.4.x stops iterating on divergent chunks, but does not flag them. CubiCal 1.5.0 has an option to flag them as well (@bennahugo I remember we put this code for you, or with you...). The option is called --sol-flag-divergence, it is on by default, and in the case of Paolo's data it produces horrible results. If madmax is allowed to zap the RFI first, nothing diverges so the results are fine.
Paolo's old command cut off the solver at 5 iterations. I've let it go on, and it wants to do at least 10, but this produces no discernible improvement to the images.

So, take-home messages:

(Once again), madmax proves he's your friend. Leave him on and let him do his thing. I.e. use 10 and 12, not 0,10 and 0,12.
workaround until cubical PR and bugfix release: try complex-2x2 and update-type phase-diag, with thresholds of 10 and 12. Or use 1.4.7.
--sol-flag-divergence should be used with caution, I think we should default it to off @bennahugo. In this case it was making things worse (without an initial madmax).

On a side note, in the dual-corr case, the chi-squares reported by 1.5.x are exactly twice those of 1.4.x. @PeterKamphuis already noted this. I found the factor of 2 in the code, and I'm pretty convinced the 1.5.x scaling is the correct one.

o-smirnov commented 4 years ago

P.S. By horrible results, I mean 80% flagged data. The remaining 20% looks fine...

ratt-ru / CubiCal

phase-only solver regression in 1.5.0 #379