Closed o-smirnov closed 5 years ago
Prohibit auto-init of BITFLAG and changing of FLAG in the same run? (I.e. only allow it to change FLAGs when it knows a proper BITFLAG column is in place)? That may be the safest, but it does mean running cubical twice (at least the first time 'round) before any solver/madmax flags can kick in.
Actually, how about a compromise? Clearly, the dangerous part is initializing BITFLAG and fiddling with FLAG in the same run, as this can leave the MS in a funky state if interrupted.
So let's forget this on-the-fly bitflag auto-init shizzle. Worst idea since Dave & Georgie got pissed and decided to throw a referendum.
Instead: if it's missing a BITFLAG column, first thing it does is sits down and initializes it from FLAG. If this is successful, only then does it proceed to solve and write flags in the normal way.
What this means is a bit of delay during the first run of a virgin MS, because it has to cycle through the MS twice. But then e.g. backing up the original FLAG column would incur the same kind of delay anyway.
BTW @IanHeywood, if you run in single-CPU mode, can you do us a favour, and routinely run it as
python -mcProfile -o <YOUROUTPUTNAME>.prof `which gocubical` <parameters...>
It shouldn't affect anything, apart from making it dump a profile file at the end. These are very useful to look at, to see where the CPU time is going in your particular setup, and if anything can be speeded up.
(Less useful in multi-CPU mode, I'm afraid...)
OK, just as a note to myself for now. This:
[tab.iscelldefined("BITFLAG",i) for i in xrange(nrow)]
is very quick (2s on a MeerKAT-sized MS), so it really costs nothing to check if BITFLAG is properly filled upfront. So the proposal in https://github.com/ratt-ru/CubiCal/issues/221#issuecomment-418191622 will incur relatively little overhead.
Note that the issue-222 branch seemed to work just fine when I tried it on a different MS to the one I've been using for this thread. I'll try it with the original test case as soon as the IDIA systems get resurrected. Cheers.
Seems to be working fine on my original problem case, although the max post and prior error parameters are still set high (see https://github.com/ratt-ru/CubiCal/issues/222).
I tried to move on to the dE calibration and was met with this:
d into FLAG/FLAG_ROW
- 08:58:35 - data_handler [0.2/0.2 5.4/5.4 0.0Gb] Using 1 model(s) for 5 directions(s)
- 08:58:35 - data_handler [0.2/0.2 5.4/5.4 0.0Gb] model 0 (weight WEIGHT_SPECTRUM):
- 08:58:35 - data_handler [0.2/0.2 5.4/5.4 0.0Gb] direction 0: MODEL_DATA
- 08:58:35 - data_handler [0.2/0.2 5.4/5.4 0.0Gb] direction 1: DIR1
- 08:58:35 - data_handler [0.2/0.2 5.4/5.4 0.0Gb] direction 2: DIR2
- 08:58:35 - data_handler [0.2/0.2 5.4/5.4 0.0Gb] direction 3: DIR3
- 08:58:35 - data_handler [0.2/0.2 5.4/5.4 0.0Gb] direction 4: DIR4
- 08:58:35 - main [0.2/0.2 5.4/5.4 0.0Gb] subtraction directions set to slice(None, None, None)
- 08:58:35 - param_db [0.2/0.2 5.4/5.4 0.0Gb] creating 1529209968_sdp_l0_1284.full_pol_fg_wtspec_47Tuc_pcal.ms/G-field_0-ddid_None.parmdb in fragmented mode
- 08:58:35 - gain_machine [0.2/0.2 5.4/5.4 0.0Gb] G solutions will be saved to 1529209968_sdp_l0_1284.full_pol_fg_wtspec_47Tuc_pcal.ms/G-field_0-ddid_None.parmdb
- 08:58:35 - param_db [0.2/0.2 5.4/5.4 0.0Gb] creating 1529209968_sdp_l0_1284.full_pol_fg_wtspec_47Tuc_pcal.ms/de-field_0-ddid_None.parmdb in fragmented mode
- 08:58:35 - gain_machine [0.2/0.2 5.4/5.4 0.0Gb] de solutions will be saved to 1529209968_sdp_l0_1284.full_pol_fg_wtspec_47Tuc_pcal.ms/de-field_0-ddid_None.parmdb
- 08:58:35 - param_db [0.2/0.2 5.4/5.4 0.0Gb] creating 1529209968_sdp_l0_1284.full_pol_fg_wtspec_47Tuc_pcal.ms/BBC-field_0-ddid_None.parmdb in fragmented mode
- 08:58:35 - gain_machine [0.2/0.2 5.4/5.4 0.0Gb] will compute & save suggested baseline-based corrections (BBCs) to 1529209968_sdp_l0_1284.full_pol_fg_wtspec_47Tuc_pcal.ms/BBC-field_0-ddid_None.parmdb
- 08:58:35 - gain_machine [0.2/0.2 5.4/5.4 0.0Gb] (these can optionally be applied in a subsequent CubiCal run)
- 08:58:35 - main [0.2/0.2 5.4/5.4 0.0Gb] single-process, single-thread mode
- 08:58:35 - main [0.2/0.2 5.4/5.4 0.0Gb] defining chunks (time 588, freq 256)
- 08:58:36 - data_handler [0.3/0.3 5.5/5.5 0.0Gb] read indexing columns (4583691 total rows)
- 08:58:37 - data_handler [0.4/0.4 5.6/5.6 0.0Gb] built timeslot index (2347 unique timestamps)
- 08:58:37 - data_handler [0.4/0.4 5.6/5.6 0.0Gb] max chunk size is 588 timeslots and/or -- seconds
- 08:58:37 - data_handler [0.4/0.4 5.6/5.6 0.0Gb] computing time rebinning by 4 timeslots
- 08:58:50 - data_handler [0.5/1.4 5.7/6.6 0.0Gb] found 4 time chunks: 0 588 1176 1764 2347
- 08:58:50 - data_handler [0.5/1.4 5.7/6.6 0.0Gb] will rebin into 1146411 rows (587 rebinned timeslots)
- 08:58:50 - data_handler [0.5/1.4 5.7/6.6 0.0Gb] WARNING: output columns will be upsampled from time-binned data!
- 08:58:50 - data_handler [0.6/1.4 5.8/6.6 0.0Gb] generated 4 row chunks based on time and DDID
- 08:58:50 - data_handler [0.6/1.4 5.8/6.6 0.0Gb] row chunks yield 4 potential tiles
- 08:58:51 - data_handler [0.6/1.4 5.9/6.6 0.0Gb] coarsening this to 4 tiles (max 4 chunks per tile, based on 1/1 requested)
- 08:58:51 - data_handler [0.4/1.4 5.6/6.6 0.0Gb] tile #1/4: reading MS rows 0~1148363
- 09:01:46 - data_handler [55.1/59.3 88.9/93.3 0.0Gb] applying solvable baseline cutoff deselects 100019 rows
- 09:01:46 - data_handler [55.1/59.3 88.9/93.3 0.0Gb] 8.71% visibilities deselected via specificed subset and/or baseline cutoffs
- 09:01:50 - data_handler [58.0/59.3 106.4/106.4 0.0Gb] auto-filling BITFLAG/BITFLAG_ROW of shape (1148364, 1024, 4) from FLAG/FLAG_ROW
- 09:03:38 - main [67.5/89.4 115.2/137.1 0.0Gb] Exiting with exception: UnboundLocalError(local variable 'flag_arr' referenced before assignment)
Traceback (most recent call last):
File "/users/ianh/Software/CubiCal/cubical/main.py", line 360, in main
stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts)
File "/users/ianh/Software/CubiCal/cubical/workers.py", line 207, in run_process_loop
return _run_single_process_loop(ms, load_model, single_chunk, solver_type, solver_opts, debug_opts)
File "/users/ianh/Software/CubiCal/cubical/workers.py", line 316, in _run_single_process_loop
tile.load(load_model=load_model)
File "/users/ianh/Software/CubiCal/cubical/data_handler/ms_tile.py", line 735, in load
flag_arr[invalid] |= FL.INVALID
UnboundLocalError: local variable 'flag_arr' referenced before assignment
ianh@helo:~/Software/CubiCal$ git branch
* issue-222
Silly typo. I pushed a fix, try again please.
It's running now, cheers.
OK, I have implemented the strategy suggested here: https://github.com/ratt-ru/CubiCal/issues/221#issuecomment-418191622
There is now a pre-init stage where it fills BITFLAGs if needed. This doesn't even take all that long, a full MK64 MS took a minute or so. And this is a one-off operation.
Hopefully, this solves the silliness reported in #211 properly and for good. @IanHeywood would you mind testing?
Not sure if this is related to the BITFLAG fix, but:
- 08:54:04 - main [1.4/2.6 4.9/5.0 8.0Gb] finished processing tile #34/62
- 08:54:04 - main [1.4/2.6 4.9/5.0 8.0Gb] waiting for I/O on tile #35/62
- 08:54:04 - main [1.4/2.6 4.9/5.0 8.0Gb] submitting solver jobs for tile #35/62
- 08:54:04 - main [io] [26.4/38.9 28.9/43.8 8.0Gb] saving tile #34/62
- 08:54:04 - data_handler [io] [26.4/38.9 28.9/43.8 8.0Gb] tile #34/62: saving MS rows 4907889~4950854
- 08:54:04 - data_handler [io] [26.4/38.9 29.3/43.8 8.0Gb] writing CORRECTED_DATA column
- 08:54:05 - data_handler [io] [26.9/38.9 29.4/43.8 8.0Gb] 30.59% visibilities flagged by solver: saving to BITFLAG and FLAG columns
- 08:54:05 - data_handler [io] [26.9/38.9 29.4/43.8 8.0Gb] updated BITFLAG column (30.59% visibilities flagged)
- 08:54:05 - data_handler [io] [26.9/38.9 29.4/43.8 8.0Gb] updated BITFLAG_ROW column (29.39% rows flagged)
- 08:54:05 - data_handler [io] [26.9/38.9 29.4/43.8 8.0Gb] updated FLAG column (30.59% visibilities flagged)
- 08:54:05 - data_handler [io] [26.9/38.9 29.4/43.8 8.0Gb] updated FLAG_ROW column (29.39% rows flagged)
- 08:54:07 - main [io] [26.4/38.9 28.9/43.8 6.4Gb] loading tile #36/62
- 08:54:07 - data_handler [io] [26.4/38.9 28.9/43.8 6.4Gb] tile #36/62: reading MS rows 5200839~5241851
- 08:54:08 - data_handler [io] [26.6/38.9 29.1/43.8 6.4Gb] applying solvable baseline cutoff deselects 3444 rows
- 08:54:08 - data_handler [io] [26.6/38.9 29.1/43.8 6.4Gb] 8.40% visibilities deselected via specificed subset and/or baseline cutoffs
- 08:54:10 - data_handler [io] [26.8/38.9 29.3/43.8 6.4Gb] 3.10% input visibilities flagged as invalid (0/inf/nan)
- 08:54:10 - data_handler [io] [26.8/38.9 29.4/43.8 6.4Gb] 11.50% input visibilities flagged and/or deselected
- 08:54:10 - data_handler [io] [27.5/38.9 30.4/43.8 7.1Gb] reading MODEL_DATA for model 0 direction 0
- 08:54:12 - data_handler [io] [27.9/38.9 30.4/43.8 7.5Gb] 10.58% visibilities flagged due to 0/inf/nan model
- 08:54:12 - main [io] [26.6/38.9 29.1/43.8 7.5Gb] I/O job(s) complete
- 08:56:01 - solver [x15] [9.4/12.7 12.8/16.0 7.5Gb] Solver for tile 34 chunk D0T34F1 failed with exception: float division by zero
- 08:56:01 - solver [x15] [9.4/12.7 12.8/16.0 7.5Gb] Traceback (most recent call last):
File "/home/ianh/Software/CubiCal/cubical/solver.py", line 783, in run_solver
corr_vis, stats = solver(vdm, soldict, label, sol_opts)
File "/home/ianh/Software/CubiCal/cubical/solver.py", line 555, in solve_and_correct
_, stats = _solve_gains(vdm.gm, vdm.weighted_obser, vdm.weighted_model, vdm.flags_arr, sol_opts, label=label)
File "/home/ianh/Software/CubiCal/cubical/solver.py", line 265, in _solve_gains
madmax.beyond_thunderdome(resid_arr, obser_arr, model_arr, flags_arr, thr1, thr2, "{} iter {} ({})".format(label, num_iter, gm.jones_label))
File "/home/ianh/Software/CubiCal/cubical/madmax/flagger.py", line 235, in beyond_thunderdome
"baseline-based Mad Max ({} sigma)".format(threshold), max_label)
File "/home/ianh/Software/CubiCal/cubical/madmax/flagger.py", line 156, in report_carnage
figure = plots.make_dual_absres_plot(absres, flags_arr!=0, baddies, p, q, self.metadata, subplot_titles)
File "/home/ianh/Software/CubiCal/cubical/madmax/plots.py", line 27, in make_dual_absres_plot
pylab.colorbar()
File "/home/ianh/venv/meerkat/local/lib/python2.7/site-packages/matplotlib/pyplot.py", line 2328, in colorbar
ret = gcf().colorbar(mappable, cax = cax, ax=ax, **kw)
File "/home/ianh/venv/meerkat/local/lib/python2.7/site-packages/matplotlib/figure.py", line 2071, in colorbar
cb = cbar.colorbar_factory(cax, mappable, **cb_kw)
File "/home/ianh/venv/meerkat/local/lib/python2.7/site-packages/matplotlib/colorbar.py", line 1399, in colorbar_factory
cb = Colorbar(cax, mappable, **kwargs)
File "/home/ianh/venv/meerkat/local/lib/python2.7/site-packages/matplotlib/colorbar.py", line 945, in __init__
ColorbarBase.__init__(self, ax, **kw)
File "/home/ianh/venv/meerkat/local/lib/python2.7/site-packages/matplotlib/colorbar.py", line 327, in __init__
self.draw_all()
File "/home/ianh/venv/meerkat/local/lib/python2.7/site-packages/matplotlib/colorbar.py", line 349, in draw_all
self._process_values()
File "/home/ianh/venv/meerkat/local/lib/python2.7/site-packages/matplotlib/colorbar.py", line 705, in _process_values
b = self.norm.inverse(self._uniform_y(self.cmap.N + 1))
File "/home/ianh/venv/meerkat/local/lib/python2.7/site-packages/matplotlib/colors.py", line 1048, in inverse
return vmin * np.ma.power((vmax / vmin), val)
ZeroDivisionError: float division by zero
- 08:57:40 - solver [x18] [8.9/13.5 12.2/16.4 7.5Gb] D0T34F3 G: 20 iters, conv 0.00%, d/fl 3.23%, PGE 0.127, stall 5.40%, chi^2 1.459 -> 1.089 (1.094), noise
0.566 -> 0.566
- 08:57:40 - solver [x18] [8.9/13.5 12.2/16.4 7.5Gb] D0T34F3 Pretend-Mad Max took out 6160410 visibilities: 0 (0.00%) new data flags
- 08:57:53 - solver [x11] [8.8/13.5 12.2/16.4 9.0Gb] D0T34F2 G: 20 iters, conv 0.00%, d/fl 3.23%, PGE 0.107, stall 13.40%, chi^2 1.604 -> 1.116 (1.128), nois
e 0.204 -> 0.203
- 08:57:53 - solver [x11] [8.8/13.5 12.2/16.4 9.0Gb] D0T34F2 Pretend-Mad Max took out 548568 visibilities: 0 (0.00%) new data flags
- 08:57:57 - solver [x13] [8.9/12.7 12.2/16.0 9.0Gb] D0T34F0 G: 20 iters, conv 0.00%, d/fl 3.23%, PGE 0.0723, stall 30.23%, chi^2 4.462 -> 4.463 (3.8), noise
92.4 -> 92.4
- 08:57:57 - solver [x13] [8.9/12.7 12.2/16.0 9.0Gb] D0T34F0 Pretend-Mad Max took out 1537042 visibilities: 0 (0.00%) new data flags
- 08:58:06 - main [1.4/2.6 4.8/5.0 9.9Gb] Exiting with exception: ZeroDivisionError(float division by zero)
Traceback (most recent call last):
File "/home/ianh/Software/CubiCal/cubical/main.py", line 364, in main
stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts)
File "/home/ianh/Software/CubiCal/cubical/workers.py", line 205, in run_process_loop
return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts)
File "/home/ianh/Software/CubiCal/cubical/workers.py", line 274, in _run_multi_process_loop
stats = future.result()
File "/home/ianh/venv/meerkat/local/lib/python2.7/site-packages/concurrent/futures/_base.py", line 455, in result
return self.__get_result()
File "/home/ianh/venv/meerkat/local/lib/python2.7/site-packages/concurrent/futures/_base.py", line 414, in __get_result
raise exception_type, self._exception, self._traceback
ZeroDivisionError: float division by zero
Gah. Falling over inside of matplotlib. Its log-scale image plots are proving to be a very fragile business. It was barfing over all-flagged or all-zero images before, so I put this check in for protection. But looks like there's another failure mode to be discovered...
Can I quickly log in and run it somewhere?
Sure, if you log in to Nash, and cd to:
/home/ianh/MIGHTEE/COSMOS_SPLIT/LO
you can run:
python run_cubical.py phasecal.parset pcal False
I set --madmax-plot = 0
and it ran to completion, so I don't think there's anything more sinister going on.
Nothing sinister, 100%. It's clearly the plotting code falling over on some odd bit of data somewhere.
I just need to put an exception handler around the plotting code so that cubical can carry on even if matplotlib shits the bed.
Are you done with this test chief? Can I proceed with it (plots disabled if necessary...)?
Cheers.
I think some piece of flag machinery somewhere continues to misbehave. Still on the issue-222 branch...
Here's the state of the parent MS:
I averaged 4x in frequency, imaged DATA to make a model, ran phase-only selfcal, imaged CORRECTED_DATA, saw that the map looked like a dog's breakfast, plotted the same baseline as above and here's what's there:
For now, I'm going to revert to master and disable bells and whistles.
OK, clearly it's cleared some flags (no pun intended). This is not the intended behaviour (no pun... oh nevermind.) Can I please repeat this experiment somewhere?
Sure, same steps as:
https://github.com/ratt-ru/CubiCal/issues/221#issuecomment-428390409
should do it. Although if you want I can delete the MS as they are and re-make the averaged ones. The current ones will AFAIK have the broken flags.
Yeah if you can re-make an averaged one, that'd be great. And give me a command for replicating the plot above.
Thanks, remaking now, will comment again when the new MS are cooked.
Raised by @IanHeywood here: https://github.com/ratt-ru/CubiCal/issues/211#issuecomment-417711651
I can see part of the problem. When auto-initializing BITFLAG from FLAG, I mark the BITFLAG column with an "in progress" keyword. When a run is successful, this keyword is cleared. The overly bright idea was, if a run is interrupted when the column is in a half-filled state, cubical knows it has to go reinitialize it the next time it is run.
Due to sheer stupid oversight on my part, the keyword is not cleared when cubical is run in single-CPU mode. So each time it proceeds to reinitialize it all over again. If one of the runs happens to flag 99% of the data (due to user error, or other bugs in madmax or solver flagging, for example), these flags then end up in the "legacy" flagset (rather than "cubical"), at which point we're all screwed.
OK, so fixing the bug and clearing the right keyword is simple enough, but I wonder if this whole design isn't misguided on my part. There's still the possibility of the user Ctrl+C'ing the first auto-initializing run, and leaving BITFLAG in a partly-filled state. What must cubical do on the next run? After the Ctrl+C, the MS is in the following state:
BITFLAG is only partially filled (part of it is filled, with the "legacy" bit containing a copy of FLAG/FLAG_ROW before the run, and the rest of the column in an undefined state)
The start of the FLAG column may be overflagged (since it contains a combination of "legacy" and "cubical" flags), while the rest of the column is still in its original state.
At this point, throwing away BITFLAG and re-populating it from FLAG (which the auto-init code does) is clearly the wrong thing to do. But what's the right thing?
Bomb out with a fat error message and let the user recover manually? (But how?)
Bomb out with a fat error message unless a special "proceed anyway" command-line option is given?
Try to keep track of how many BITFLAG rows have been initialized, and do something smart? I'd prefer to do this, but trying to do something smart is what got me into this pickle in the first place...
Prohibit auto-init of BITFLAG and changing of FLAG in the same run? (I.e. only allow it to change FLAGs when it knows a proper BITFLAG column is in place)? That may be the safest, but it does mean running cubical twice (at least the first time 'round) before any solver/madmax flags can kick in.
Thoughts, anyone?