BITFLAGs not marked as properly initialized when running in single-CPU mode

o-smirnov commented 6 years ago

Raised by @IanHeywood here: https://github.com/ratt-ru/CubiCal/issues/211#issuecomment-417711651

I can see part of the problem. When auto-initializing BITFLAG from FLAG, I mark the BITFLAG column with an "in progress" keyword. When a run is successful, this keyword is cleared. The overly bright idea was, if a run is interrupted when the column is in a half-filled state, cubical knows it has to go reinitialize it the next time it is run.

Due to sheer stupid oversight on my part, the keyword is not cleared when cubical is run in single-CPU mode. So each time it proceeds to reinitialize it all over again. If one of the runs happens to flag 99% of the data (due to user error, or other bugs in madmax or solver flagging, for example), these flags then end up in the "legacy" flagset (rather than "cubical"), at which point we're all screwed.

OK, so fixing the bug and clearing the right keyword is simple enough, but I wonder if this whole design isn't misguided on my part. There's still the possibility of the user Ctrl+C'ing the first auto-initializing run, and leaving BITFLAG in a partly-filled state. What must cubical do on the next run? After the Ctrl+C, the MS is in the following state:

BITFLAG is only partially filled (part of it is filled, with the "legacy" bit containing a copy of FLAG/FLAG_ROW before the run, and the rest of the column in an undefined state)
The start of the FLAG column may be overflagged (since it contains a combination of "legacy" and "cubical" flags), while the rest of the column is still in its original state.

At this point, throwing away BITFLAG and re-populating it from FLAG (which the auto-init code does) is clearly the wrong thing to do. But what's the right thing?

Bomb out with a fat error message and let the user recover manually? (But how?)
Bomb out with a fat error message unless a special "proceed anyway" command-line option is given?
Try to keep track of how many BITFLAG rows have been initialized, and do something smart? I'd prefer to do this, but trying to do something smart is what got me into this pickle in the first place...
Prohibit auto-init of BITFLAG and changing of FLAG in the same run? (I.e. only allow it to change FLAGs when it knows a proper BITFLAG column is in place)? That may be the safest, but it does mean running cubical twice (at least the first time 'round) before any solver/madmax flags can kick in.

Thoughts, anyone?

o-smirnov commented 6 years ago

Prohibit auto-init of BITFLAG and changing of FLAG in the same run? (I.e. only allow it to change FLAGs when it knows a proper BITFLAG column is in place)? That may be the safest, but it does mean running cubical twice (at least the first time 'round) before any solver/madmax flags can kick in.

Actually, how about a compromise? Clearly, the dangerous part is initializing BITFLAG and fiddling with FLAG in the same run, as this can leave the MS in a funky state if interrupted.

So let's forget this on-the-fly bitflag auto-init shizzle. Worst idea since Dave & Georgie got pissed and decided to throw a referendum.

Instead: if it's missing a BITFLAG column, first thing it does is sits down and initializes it from FLAG. If this is successful, only then does it proceed to solve and write flags in the normal way.

What this means is a bit of delay during the first run of a virgin MS, because it has to cycle through the MS twice. But then e.g. backing up the original FLAG column would incur the same kind of delay anyway.

o-smirnov commented 6 years ago

BTW @IanHeywood, if you run in single-CPU mode, can you do us a favour, and routinely run it as

python -mcProfile -o <YOUROUTPUTNAME>.prof `which gocubical` <parameters...>

It shouldn't affect anything, apart from making it dump a profile file at the end. These are very useful to look at, to see where the CPU time is going in your particular setup, and if anything can be speeded up.

(Less useful in multi-CPU mode, I'm afraid...)

o-smirnov commented 6 years ago

OK, just as a note to myself for now. This:

[tab.iscelldefined("BITFLAG",i) for i in xrange(nrow)]

is very quick (2s on a MeerKAT-sized MS), so it really costs nothing to check if BITFLAG is properly filled upfront. So the proposal in https://github.com/ratt-ru/CubiCal/issues/221#issuecomment-418191622 will incur relatively little overhead.

IanHeywood commented 6 years ago

Note that the issue-222 branch seemed to work just fine when I tried it on a different MS to the one I've been using for this thread. I'll try it with the original test case as soon as the IDIA systems get resurrected. Cheers.

IanHeywood commented 6 years ago

Seems to be working fine on my original problem case, although the max post and prior error parameters are still set high (see https://github.com/ratt-ru/CubiCal/issues/222).

IanHeywood commented 6 years ago

I tried to move on to the dE calibration and was met with this:

d into FLAG/FLAG_ROW
 - 08:58:35 - data_handler       [0.2/0.2 5.4/5.4 0.0Gb] Using 1 model(s) for 5 directions(s)
 - 08:58:35 - data_handler       [0.2/0.2 5.4/5.4 0.0Gb]   model 0 (weight WEIGHT_SPECTRUM):
 - 08:58:35 - data_handler       [0.2/0.2 5.4/5.4 0.0Gb]     direction 0: MODEL_DATA
 - 08:58:35 - data_handler       [0.2/0.2 5.4/5.4 0.0Gb]     direction 1: DIR1
 - 08:58:35 - data_handler       [0.2/0.2 5.4/5.4 0.0Gb]     direction 2: DIR2
 - 08:58:35 - data_handler       [0.2/0.2 5.4/5.4 0.0Gb]     direction 3: DIR3
 - 08:58:35 - data_handler       [0.2/0.2 5.4/5.4 0.0Gb]     direction 4: DIR4
 - 08:58:35 - main               [0.2/0.2 5.4/5.4 0.0Gb] subtraction directions set to slice(None, None, None)
 - 08:58:35 - param_db           [0.2/0.2 5.4/5.4 0.0Gb] creating 1529209968_sdp_l0_1284.full_pol_fg_wtspec_47Tuc_pcal.ms/G-field_0-ddid_None.parmdb in fragmented mode
 - 08:58:35 - gain_machine       [0.2/0.2 5.4/5.4 0.0Gb] G solutions will be saved to 1529209968_sdp_l0_1284.full_pol_fg_wtspec_47Tuc_pcal.ms/G-field_0-ddid_None.parmdb
 - 08:58:35 - param_db           [0.2/0.2 5.4/5.4 0.0Gb] creating 1529209968_sdp_l0_1284.full_pol_fg_wtspec_47Tuc_pcal.ms/de-field_0-ddid_None.parmdb in fragmented mode
 - 08:58:35 - gain_machine       [0.2/0.2 5.4/5.4 0.0Gb] de solutions will be saved to 1529209968_sdp_l0_1284.full_pol_fg_wtspec_47Tuc_pcal.ms/de-field_0-ddid_None.parmdb
 - 08:58:35 - param_db           [0.2/0.2 5.4/5.4 0.0Gb] creating 1529209968_sdp_l0_1284.full_pol_fg_wtspec_47Tuc_pcal.ms/BBC-field_0-ddid_None.parmdb in fragmented mode
 - 08:58:35 - gain_machine       [0.2/0.2 5.4/5.4 0.0Gb] will compute & save suggested baseline-based corrections (BBCs) to 1529209968_sdp_l0_1284.full_pol_fg_wtspec_47Tuc_pcal.ms/BBC-field_0-ddid_None.parmdb
 - 08:58:35 - gain_machine       [0.2/0.2 5.4/5.4 0.0Gb]   (these can optionally be applied in a subsequent CubiCal run)
 - 08:58:35 - main               [0.2/0.2 5.4/5.4 0.0Gb] single-process, single-thread mode
 - 08:58:35 - main               [0.2/0.2 5.4/5.4 0.0Gb] defining chunks (time 588, freq 256)
 - 08:58:36 - data_handler       [0.3/0.3 5.5/5.5 0.0Gb]   read indexing columns (4583691 total rows)
 - 08:58:37 - data_handler       [0.4/0.4 5.6/5.6 0.0Gb]   built timeslot index (2347 unique timestamps)
 - 08:58:37 - data_handler       [0.4/0.4 5.6/5.6 0.0Gb]   max chunk size is 588 timeslots and/or -- seconds
 - 08:58:37 - data_handler       [0.4/0.4 5.6/5.6 0.0Gb]   computing time rebinning by 4 timeslots
 - 08:58:50 - data_handler       [0.5/1.4 5.7/6.6 0.0Gb]   found 4 time chunks: 0 588 1176 1764 2347
 - 08:58:50 - data_handler       [0.5/1.4 5.7/6.6 0.0Gb]   will rebin into 1146411 rows (587 rebinned timeslots)
 - 08:58:50 - data_handler       [0.5/1.4 5.7/6.6 0.0Gb] WARNING: output columns will be upsampled from time-binned data!
 - 08:58:50 - data_handler       [0.6/1.4 5.8/6.6 0.0Gb]   generated 4 row chunks based on time and DDID
 - 08:58:50 - data_handler       [0.6/1.4 5.8/6.6 0.0Gb]   row chunks yield 4 potential tiles
 - 08:58:51 - data_handler       [0.6/1.4 5.9/6.6 0.0Gb]   coarsening this to 4 tiles (max 4 chunks per tile, based on 1/1 requested)
 - 08:58:51 - data_handler       [0.4/1.4 5.6/6.6 0.0Gb] tile #1/4: reading MS rows 0~1148363
 - 09:01:46 - data_handler       [55.1/59.3 88.9/93.3 0.0Gb]   applying solvable baseline cutoff deselects 100019 rows
 - 09:01:46 - data_handler       [55.1/59.3 88.9/93.3 0.0Gb]   8.71% visibilities deselected via specificed subset and/or baseline cutoffs
 - 09:01:50 - data_handler       [58.0/59.3 106.4/106.4 0.0Gb]   auto-filling BITFLAG/BITFLAG_ROW of shape (1148364, 1024, 4) from FLAG/FLAG_ROW
 - 09:03:38 - main               [67.5/89.4 115.2/137.1 0.0Gb] Exiting with exception: UnboundLocalError(local variable 'flag_arr' referenced before assignment)
 Traceback (most recent call last):
  File "/users/ianh/Software/CubiCal/cubical/main.py", line 360, in main
    stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts)
  File "/users/ianh/Software/CubiCal/cubical/workers.py", line 207, in run_process_loop
    return _run_single_process_loop(ms, load_model, single_chunk, solver_type, solver_opts, debug_opts)
  File "/users/ianh/Software/CubiCal/cubical/workers.py", line 316, in _run_single_process_loop
    tile.load(load_model=load_model)
  File "/users/ianh/Software/CubiCal/cubical/data_handler/ms_tile.py", line 735, in load
    flag_arr[invalid] |= FL.INVALID
UnboundLocalError: local variable 'flag_arr' referenced before assignment

IanHeywood commented 6 years ago

ianh@helo:~/Software/CubiCal$ git branch
* issue-222

o-smirnov commented 6 years ago

Silly typo. I pushed a fix, try again please.

IanHeywood commented 6 years ago

It's running now, cheers.

o-smirnov commented 6 years ago

OK, I have implemented the strategy suggested here: https://github.com/ratt-ru/CubiCal/issues/221#issuecomment-418191622

There is now a pre-init stage where it fills BITFLAGs if needed. This doesn't even take all that long, a full MK64 MS took a minute or so. And this is a one-off operation.

Hopefully, this solves the silliness reported in #211 properly and for good. @IanHeywood would you mind testing?

IanHeywood commented 6 years ago

Not sure if this is related to the BITFLAG fix, but:

 - 08:54:04 - main               [1.4/2.6 4.9/5.0 8.0Gb] finished processing tile #34/62
 - 08:54:04 - main               [1.4/2.6 4.9/5.0 8.0Gb] waiting for I/O on tile #35/62
 - 08:54:04 - main               [1.4/2.6 4.9/5.0 8.0Gb] submitting solver jobs for tile #35/62
 - 08:54:04 - main               [io] [26.4/38.9 28.9/43.8 8.0Gb] saving tile #34/62
 - 08:54:04 - data_handler       [io] [26.4/38.9 28.9/43.8 8.0Gb] tile #34/62: saving MS rows 4907889~4950854
 - 08:54:04 - data_handler       [io] [26.4/38.9 29.3/43.8 8.0Gb]   writing CORRECTED_DATA column
 - 08:54:05 - data_handler       [io] [26.9/38.9 29.4/43.8 8.0Gb]   30.59% visibilities flagged by solver: saving to BITFLAG and FLAG columns
 - 08:54:05 - data_handler       [io] [26.9/38.9 29.4/43.8 8.0Gb]   updated BITFLAG column (30.59% visibilities flagged)
 - 08:54:05 - data_handler       [io] [26.9/38.9 29.4/43.8 8.0Gb]   updated BITFLAG_ROW column (29.39% rows flagged)
 - 08:54:05 - data_handler       [io] [26.9/38.9 29.4/43.8 8.0Gb]   updated FLAG column (30.59% visibilities flagged)
 - 08:54:05 - data_handler       [io] [26.9/38.9 29.4/43.8 8.0Gb]   updated FLAG_ROW column (29.39% rows flagged)
 - 08:54:07 - main               [io] [26.4/38.9 28.9/43.8 6.4Gb] loading tile #36/62
 - 08:54:07 - data_handler       [io] [26.4/38.9 28.9/43.8 6.4Gb] tile #36/62: reading MS rows 5200839~5241851
 - 08:54:08 - data_handler       [io] [26.6/38.9 29.1/43.8 6.4Gb]   applying solvable baseline cutoff deselects 3444 rows
 - 08:54:08 - data_handler       [io] [26.6/38.9 29.1/43.8 6.4Gb]   8.40% visibilities deselected via specificed subset and/or baseline cutoffs
 - 08:54:10 - data_handler       [io] [26.8/38.9 29.3/43.8 6.4Gb]   3.10% input visibilities flagged as invalid (0/inf/nan)
 - 08:54:10 - data_handler       [io] [26.8/38.9 29.4/43.8 6.4Gb]   11.50% input visibilities flagged and/or deselected
 - 08:54:10 - data_handler       [io] [27.5/38.9 30.4/43.8 7.1Gb]   reading MODEL_DATA for model 0 direction 0
 - 08:54:12 - data_handler       [io] [27.9/38.9 30.4/43.8 7.5Gb]   10.58% visibilities flagged due to 0/inf/nan model
 - 08:54:12 - main               [io] [26.6/38.9 29.1/43.8 7.5Gb] I/O job(s) complete
 - 08:56:01 - solver             [x15] [9.4/12.7 12.8/16.0 7.5Gb] Solver for tile 34 chunk D0T34F1 failed with exception: float division by zero
 - 08:56:01 - solver             [x15] [9.4/12.7 12.8/16.0 7.5Gb] Traceback (most recent call last):
  File "/home/ianh/Software/CubiCal/cubical/solver.py", line 783, in run_solver
    corr_vis, stats = solver(vdm, soldict, label, sol_opts)
  File "/home/ianh/Software/CubiCal/cubical/solver.py", line 555, in solve_and_correct
    _, stats = _solve_gains(vdm.gm, vdm.weighted_obser, vdm.weighted_model, vdm.flags_arr, sol_opts, label=label)
  File "/home/ianh/Software/CubiCal/cubical/solver.py", line 265, in _solve_gains
    madmax.beyond_thunderdome(resid_arr, obser_arr, model_arr, flags_arr, thr1, thr2, "{} iter {} ({})".format(label, num_iter, gm.jones_label))
  File "/home/ianh/Software/CubiCal/cubical/madmax/flagger.py", line 235, in beyond_thunderdome
    "baseline-based Mad Max ({} sigma)".format(threshold), max_label)
  File "/home/ianh/Software/CubiCal/cubical/madmax/flagger.py", line 156, in report_carnage
    figure = plots.make_dual_absres_plot(absres, flags_arr!=0, baddies, p, q, self.metadata, subplot_titles)
  File "/home/ianh/Software/CubiCal/cubical/madmax/plots.py", line 27, in make_dual_absres_plot
    pylab.colorbar()
  File "/home/ianh/venv/meerkat/local/lib/python2.7/site-packages/matplotlib/pyplot.py", line 2328, in colorbar
    ret = gcf().colorbar(mappable, cax = cax, ax=ax, **kw)
  File "/home/ianh/venv/meerkat/local/lib/python2.7/site-packages/matplotlib/figure.py", line 2071, in colorbar
    cb = cbar.colorbar_factory(cax, mappable, **cb_kw)
  File "/home/ianh/venv/meerkat/local/lib/python2.7/site-packages/matplotlib/colorbar.py", line 1399, in colorbar_factory
    cb = Colorbar(cax, mappable, **kwargs)
  File "/home/ianh/venv/meerkat/local/lib/python2.7/site-packages/matplotlib/colorbar.py", line 945, in __init__
    ColorbarBase.__init__(self, ax, **kw)
  File "/home/ianh/venv/meerkat/local/lib/python2.7/site-packages/matplotlib/colorbar.py", line 327, in __init__
    self.draw_all()
  File "/home/ianh/venv/meerkat/local/lib/python2.7/site-packages/matplotlib/colorbar.py", line 349, in draw_all
    self._process_values()
  File "/home/ianh/venv/meerkat/local/lib/python2.7/site-packages/matplotlib/colorbar.py", line 705, in _process_values
    b = self.norm.inverse(self._uniform_y(self.cmap.N + 1))
  File "/home/ianh/venv/meerkat/local/lib/python2.7/site-packages/matplotlib/colors.py", line 1048, in inverse
    return vmin * np.ma.power((vmax / vmin), val)
ZeroDivisionError: float division by zero
 - 08:57:40 - solver             [x18] [8.9/13.5 12.2/16.4 7.5Gb] D0T34F3 G: 20 iters, conv 0.00%, d/fl 3.23%, PGE 0.127, stall 5.40%, chi^2 1.459 -> 1.089 (1.094), noise
 0.566 -> 0.566
 - 08:57:40 - solver             [x18] [8.9/13.5 12.2/16.4 7.5Gb] D0T34F3 Pretend-Mad Max took out 6160410 visibilities: 0 (0.00%) new data flags
 - 08:57:53 - solver             [x11] [8.8/13.5 12.2/16.4 9.0Gb] D0T34F2 G: 20 iters, conv 0.00%, d/fl 3.23%, PGE 0.107, stall 13.40%, chi^2 1.604 -> 1.116 (1.128), nois
e 0.204 -> 0.203
 - 08:57:53 - solver             [x11] [8.8/13.5 12.2/16.4 9.0Gb] D0T34F2 Pretend-Mad Max took out 548568 visibilities: 0 (0.00%) new data flags
 - 08:57:57 - solver             [x13] [8.9/12.7 12.2/16.0 9.0Gb] D0T34F0 G: 20 iters, conv 0.00%, d/fl 3.23%, PGE 0.0723, stall 30.23%, chi^2 4.462 -> 4.463 (3.8), noise
 92.4 -> 92.4
 - 08:57:57 - solver             [x13] [8.9/12.7 12.2/16.0 9.0Gb] D0T34F0 Pretend-Mad Max took out 1537042 visibilities: 0 (0.00%) new data flags
 - 08:58:06 - main               [1.4/2.6 4.8/5.0 9.9Gb] Exiting with exception: ZeroDivisionError(float division by zero)
 Traceback (most recent call last):
  File "/home/ianh/Software/CubiCal/cubical/main.py", line 364, in main
    stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts)
  File "/home/ianh/Software/CubiCal/cubical/workers.py", line 205, in run_process_loop
    return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts)
  File "/home/ianh/Software/CubiCal/cubical/workers.py", line 274, in _run_multi_process_loop
    stats = future.result()
  File "/home/ianh/venv/meerkat/local/lib/python2.7/site-packages/concurrent/futures/_base.py", line 455, in result
    return self.__get_result()
  File "/home/ianh/venv/meerkat/local/lib/python2.7/site-packages/concurrent/futures/_base.py", line 414, in __get_result
    raise exception_type, self._exception, self._traceback
ZeroDivisionError: float division by zero

o-smirnov commented 6 years ago

Gah. Falling over inside of matplotlib. Its log-scale image plots are proving to be a very fragile business. It was barfing over all-flagged or all-zero images before, so I put this check in for protection. But looks like there's another failure mode to be discovered...

Can I quickly log in and run it somewhere?

IanHeywood commented 6 years ago

Sure, if you log in to Nash, and cd to:

/home/ianh/MIGHTEE/COSMOS_SPLIT/LO

you can run:

python run_cubical.py phasecal.parset pcal False

I set --madmax-plot = 0 and it ran to completion, so I don't think there's anything more sinister going on.

o-smirnov commented 6 years ago

Nothing sinister, 100%. It's clearly the plotting code falling over on some odd bit of data somewhere.

I just need to put an exception handler around the plotting code so that cubical can carry on even if matplotlib shits the bed.

IanHeywood commented 6 years ago

Are you done with this test chief? Can I proceed with it (plots disabled if necessary...)?

Cheers.

IanHeywood commented 6 years ago

I think some piece of flag machinery somewhere continues to misbehave. Still on the issue-222 branch...

Here's the state of the parent MS:

1524147354_sdp_l0 full full_pol_fg_lo_bandclip_wtspec_cosmos ms_baseline_2_9_field0_amp

I averaged 4x in frequency, imaged DATA to make a model, ran phase-only selfcal, imaged CORRECTED_DATA, saw that the map looked like a dog's breakfast, plotted the same baseline as above and here's what's there:

1524147354_sdp_l0 full full_pol_fg_lo_bandclip_wtspec_cosmos_1k ms_baseline_2_9_field0_amp

For now, I'm going to revert to master and disable bells and whistles.

o-smirnov commented 6 years ago

OK, clearly it's cleared some flags (no pun intended). This is not the intended behaviour (no pun... oh nevermind.) Can I please repeat this experiment somewhere?

IanHeywood commented 6 years ago

Sure, same steps as:

https://github.com/ratt-ru/CubiCal/issues/221#issuecomment-428390409

should do it. Although if you want I can delete the MS as they are and re-make the averaged ones. The current ones will AFAIK have the broken flags.

o-smirnov commented 6 years ago

Yeah if you can re-make an averaged one, that'd be great. And give me a command for replicating the plot above.

IanHeywood commented 6 years ago

Thanks, remaking now, will comment again when the new MS are cooked.

ratt-ru / CubiCal

BITFLAGs not marked as properly initialized when running in single-CPU mode #221