calico-stefcal.py: dimensions of tensor child do not match when using LSM with all-dE sources

IanHeywood commented 8 years ago

Any idea what might cause this error?

node 'MT': dimensions of tensor child do not match

I'm using a VLA MS which has thus far behaved as expected. Trying to read in a pre-computed model from the MODEL_DATA column, and solve for dEs on a two-component LSM.

o-smirnov commented 8 years ago

Yeah I've seen this happen with degenerate LSMs (ones with no time/freq axis), the internal optimizer collapses these axes and gets into trouble.

Simplest workaround is to include an LSM containing a single source of 1e-999 flux, with a spectral index.

IanHeywood commented 8 years ago

Thanks chief I'll give it a go.

IanHeywood commented 8 years ago

Problem persists having added this to the LSM:

I thought it might be ignoring the sources since 1e-999 gets rounded down to zero, but even 1e-9 makes no difference.

o-smirnov commented 8 years ago

Weird. Could you please publish the results of the DT and MT nodes, and open them up to look at dimensions of the vellsets within?

IanHeywood commented 8 years ago

Trying to publish things seems to be killing the meqserver, so I can't see any results propagating into the cache section. The dimensions of MT and DT in the request all appear to be the same.

IanHeywood commented 8 years ago

calico-wsrt-tens.py is as usual happy to chew this problem slowly.

o-smirnov commented 8 years ago

Hmmm, anywhere I can look at it live?

The MODEL_DATA column, it's got 4 correlations as usual?

Last thing to try, give Q=1e-99 to the dummy source. In fact I should have suggested from the beginning...

On Sat, Jul 2, 2016 at 5:39 AM, IanHeywood notifications@github.com wrote:

calico-wsrt-tens.py is as usual happy to chew this problem slowly.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ska-sa/meqtrees/issues/872#issuecomment-230081430, or mute the thread https://github.com/notifications/unsubscribe/AGK5vxvhA-Iek-aigcy44tmyhBQZNwVHks5qRd12gaJpZM4JBATu .

IanHeywood commented 8 years ago

MODEL_DATA is the same shape as DATA and CORRECTED. It's just the output written by wsclean.

Added Q value but problem persists. Problem goes away if I only solve for G and disable dE.

I'll try to recreate it on a Rhodes machine. I won't be able to use the browser to test it but you might be able to from there.

Cheers.

IanHeywood commented 7 years ago

On Elwood:

$ cd /home/ianh/Data/13B-308/dE_tests
$ python one_pass_cal.py

should reproduce the error.

I scp'd the MS from my local machine, and weirdly enough I had to apply the fix in meqtrees-cattery issue 34 as the script threw the

###   000: node 'VisDataMux': execute() failed: TableMeasRefDesc error: old refcode Undefined does not exist anymore (return code 0x810021)

which it wasn't doing on my local run. I've used ms.copycol to overwrite the DATA with the CORRECTED_DATA column prior to trying the stefcal run, but I wouldn't have thought that would be causing the trouble, particularly since calico-wsrt-tens.py is fine with it all.

Cheers.

o-smirnov commented 7 years ago

OK I see the issue. Workaround is to set "Apply diffgain to selected sources" and "Sources: =dE". Or in the conf file:

de_subset.subset_enabled = 1
de_subset.source_subset = =dE

You had the subset set to "all", which caused it to treat the DUMMY source as one with a dE on it too, thus making for an empty set of sources without a dE. Due to a bug in calico-stefcal.py, it does not fail gracefully in this situation.

IanHeywood commented 7 years ago

It's running now, thanks!

Nice of you to fabricate a bug at the end there to make it look like it wasn't entirely user-error.

o-smirnov commented 7 years ago

But it is a bug. User errors should result in at least mildly comprehensible error messages, which this one patently isn't... not sure I'll ever get around to fixing it since a workaround exists, but at least I'm keeping it filed as a bug.

IanHeywood commented 7 years ago

OK, just to make life interesting... looping over SPWs using the same Tigger LSM + contents of MODEL_DATA to calibrate against:

SPW0: Fine
SPW1: Fine
SPW2: Fine
SPW3: Dimensions of tensor child do not match...

WTF...?

o-smirnov commented 7 years ago

Deja vu... can has upload please?

IanHeywood commented 7 years ago

On Elwood:

$ cd /home/ianh/Data/13B-308/dE_tests2/
$ python per_scan_per_band_calibration.py

This will try to calibrate the SPWs for the first block of scans sequentially, and will fail when it gets to SPW 3 (DATA_DESC_ID==3). If you edit the script so dryrun = True then run it the terminal output should show you the steps it takes without actually running anything.

For the SPWs that run successfully it grumbles about diverging chi^2 values, but I haven't optimised anything yet, I'm just trying to get it to swallow the problem. You can see what my approach is by looking at the last few lines of setup_dE_model.py, basically I'm trying to have wsclean take care of modelling everything except the problem sources, which are excluded from MODEL_DATA and replaced by a component model to which dEs are applied.

As an aside: when I'm looping over mqt.run invocations like this, is there a way to note a failure and move on, rather than having it just die and kill all subsequent runs? All I can think of is having one script spawning another. Doesn't seem very pretty, but as you know I'm not above that sort of shoddy behaviour.

Thanks again.

IanHeywood commented 7 years ago

Having told it to skip SPW3 I note that it also fails on SPW14.

ratt-ru / meqtrees

calico-stefcal.py: dimensions of tensor child do not match when using LSM with all-dE sources #872