Poor solutions with new Solver for MG_cps_GDJones.py

Original comment thread migrated from bugzilla

at 2006-02-03 17:20:30 Oleg Smirnov replied:

Old Solver (cvs up -D "Jan 27") shows good convergence with a current version of GDJones (see the "allcorrs" bookmrk). With the new Solver, solutions are much worse.

at 2006-02-04 22:32:38 Oleg Smirnov replied:

OK, only took me a whole Saturday to find it, due to some interesting LSQFit behaviour (more on this below). The actual bug was a typo in the Vells math unary minus operation, which I had to re-implement when moving to the C99 _Complex type last week. This broke the Negate node (obscure and unused), and the MatrixInvert22 node (deservedly popular). So the solutions were good all along, it was the corrections that were wrong!

On a side note, I have to note that:

Jan, you used to have an "all math" tree that verified the operation of all function nodes. Where is it now and can you please make it into an assay script? At the risk of sounding like a broken record: here's another bug that would never have made it into CVS if we had a proper assay procedure in place.
"Circle plots" proved very useful for finding this bug. Without them, all we would have had to look at would have been slightly-more-noisy-than-expected maps, and it would have been very difficult to narrow down the cause...

Now here's the LSQFit wierdness. While tracking down this bug, I attempted to simplify things by building a tree for only 3 stations, and setting tile size to

This resulted in an under-constrained solution, dropping the rank from 25 to
However, with my new tiled solver code (which should have been functionally identical in this particular scenario, since there's no subtiling), the rank only dropped to 23, and convergence was worse. Interestingly enough, incremental solutions after iteration 1 were absolutely identical (which I verified by comparing 25 numbers down to the 10th digit...), after iteration 2 and 3 looked very close (I couldn't look at digits anymore, but the plots looked the same), but from iteration 4 the rank started dropping and that's when my new Solver node and the old Solver node diverged wildly.

One difference in the new Solver is the order in which "unknowns" are assigned (in the old Solver, it was more or less in "tree traversal order", in the new Solver it's in order of node index). Essentially this is just a reshuffling of the rows/columns in the equations matrix. So as an experiment I changed the new solver node to assign unknowns in the "old" order, and voila! the behaviour became once again identical.

This fits in with my earlier observation (see comment #4 in bug 346) that LSQFit is sensitive to the order in which equations are supplied when the solution is poorly constrained. It now seems that it is also sensitive to the order of the unknowns. Any comments?

at 2006-02-06 14:18:29 Wim Brouw replied:

LSQFit is not dependent on the order of the unknowns (just try some of the glish examples on-line). It could be that the normal equations are different (for one reason or another). Just output the 'known' and 'error' vectors and the norm equ, and compare them to see if all the same in the two cases. if so, maybe other things are different (like LMfactor use, see #346)

The rank deficiency should have disappeared if you followed my suggestions about how to solve (zero collinearity and no constraing stuff).

If indeed the band matrices contained completely independent blocks; the solutions of the full and the separate ones should be the same (just imagine the simplest blocks: only a diagonal). Again check the error, known and normequ with debug()

Can you also give the form of the condition equations? You talk about phase solutions; Jan talks about doing onl complex non-linear solutions (i.e. no ln (vis) taken). What are the equations?

at 2006-02-06 14:42:14 Oleg Smirnov replied:

LSQFit is not dependent on the order of the unknowns (just try some of the glish examples on-line). It could be that the normal equations are different (for one reason or another). Just output the 'known' and 'error' vectors and the norm equ, and compare them to see if all the same in the two cases. if so, maybe other things are different (like LMfactor use, see #346)

OK, I'll try logging this info in debugging mode. (It's hard to compare the equations though when the unknowns are reshuffled.) Everything else is absolutely identical, I've just changed the housekeeping code as follows:

there's a first loop which goes over all solvable parameters and determines the number of coefficients and subtiling for each (in this case there's no subtiling) and thus determines the number of unknowns needed to represent that parm.
A second loop goes over the collected information and allocates N unknowns per parm.

The only change was in the order of the second loop.

The rank deficiency should have disappeared if you followed my suggestions about how to solve (zero collinearity and no constraing stuff).

No, this is bug 376 and it's not implemented yet. Regardless of that, I think it's important we get to the bottom of this discrepancy.

If indeed the band matrices contained completely independent blocks; the solutions of the full and the separate ones should be the same (just imagine

No block-diagonal matrices here, I've reduced the tile size to 1 so there's only one block.

Can you also give the form of the condition equations? You talk about phase solutions; Jan talks about doing onl complex non-linear solutions (i.e. no ln (vis) taken). What are the equations?

Better check with Jan for the exact form, but I believe in this particular case they're complex non-linear. I.e. the measured visibilities are directly compared with Ji_v_Jj, where Ji is the Jones matrix for antenna i, represented as a matrix product of Gi_Di, which are built up from amplitude/phase parameters, i.e. gi(11)=a_exp(b), and we solve for a and b which are polynomials in freq. A similar scheme is used for 3C343.

at 2006-02-06 16:03:32 Oleg Smirnov replied:

For the record, I'm still seeing the occasional odd behaviour with MG_JEN_cps_GDJones. I'm now running it with Jan's default settings -- 4 stations, tilesize 10. This yields 177 unknowns per tile:

most of the time the rank drops to 175 after 4~5 iterations, and things converge in an orderly fashion after 7 iterations.
sometimes, over the exact same tile, the rank refuses to drop and stays at
Convergence takes longer, 12-13 iterations, and the incremental solutions show a characteristic "checkerboard" pattern indicating a few unknowns are bouncing back and forth around some central value. (The corrected data looks good in the end.)

How do we go about tracking this down? I can add extra debug information for the "before solve" stage (i.e. between makeNorm() and solve()) as Wim suggests.

at 2006-02-06 16:14:36 Wim Brouw replied:

The condition equations are complex. How are they fed into the normal equations?

at 2006-02-06 16:54:40 Oleg Smirnov replied:

Two makeNorm() calls per condeq, one for the real part, one for the imag part.

at 2006-03-14 14:28:03 Oleg Smirnov replied:

Closing this bug for now. When some new solver (mis)behaviour is observed, we can reopen it.

ratt-ru / meqtrees