ratt-ru / CubiCal

A fast radio interferometric calibration suite.
GNU General Public License v2.0
18 stars 13 forks source link

Debug verbosity enters the big data era (oms-deploy branch) #122

Open IanHeywood opened 6 years ago

IanHeywood commented 6 years ago

I'm trying to phase-diag cal a MeerKAT MS using just the MODEL_DATA column. It was taking 1 - 1.5 hours, but I think I had the parallelism somewhat sub-optimal.

However, having adjusted tile sizes, ncpu, nchunks, etc., now this happens:

 - 16:48:34 - main               [0.9/1.1 2.1/2.2 3.6Gb] waiting for I/O on tile #48
 - 16:55:48 - data_handler       [P1] [41.1/121.7 45.0/133.8 4.7Gb]   error reading BITFLAG column: not fatal, since we'll auto-fill it from FLAG
 - 16:57:00 - data_handler       [P1] [57.2/121.7 61.1/133.8 4.7Gb]     Traceback (most recent call last):
 - 16:57:00 - data_handler       [P1] [57.2/121.7 61.1/133.8 4.7Gb]       File "/home/heywoodl/Software/CubiCal_oms_deploy/CubiCal/cubical/data_hand
ler.py", line 477, in load
 - 16:57:00 - data_handler       [P1] [57.2/121.7 61.1/133.8 4.7Gb]         self.bflagcol = self.handler.fetchslice("BITFLAG", self.first_row, nrows
)
 - 16:57:00 - data_handler       [P1] [57.2/121.7 61.1/133.8 4.7Gb]       File "/home/heywoodl/Software/CubiCal_oms_deploy/CubiCal/cubical/data_hand
ler.py", line 1339, in fetchslice
 - 16:57:00 - data_handler       [P1] [57.2/121.7 61.1/133.8 4.7Gb]         return self.data.getcol(column, startrow, nrows)
 - 16:57:00 - data_handler       [P1] [57.2/121.7 61.1/133.8 4.7Gb]       File "/home/heywoodl/venv/cubienv/local/lib/python2.7/site-packages/casaco
re/tables/table.py", line 951, in getcol
 - 16:57:00 - data_handler       [P1] [57.2/121.7 61.1/133.8 4.7Gb]         return self._getcol (columnname, startrow, nrow, rowincr)
 - 16:57:11 - data_handler       [P1] [73.4/121.7 77.2/133.8 4.7Gb]     RuntimeError: ArrayBase::Array(const IPosition&) - Negative shape shape

Followed by reams and reams of:

[-1116273835, 1029449706, -1101523772, 1028061854, -1096853788, -1092794441, 1049594776, 1025607708, 1042561272, 1027957979, -1097153166, -1111080803, -1116086869, -1090857573, 1051811122, 1046095755, 1051456673, 1004888611, -1101241686, -1106359013, 1048753462, -1101355968, 1032576210, 1035979426, 1047173560, -1109442490, -1103674417, -1103044524, 1049818214, 1004019204, -1130518330, -1113878234, -1114248614, -1099586560, -1102907006, -1103392366, 1048992711, 1015490252, 1024915610, -1102139353, -1123074174, 998972745, -1107030267, -1131504651, 1050055259, 1040965342, -1119130617, -1095653113, 1049448740, 1042866061, 1032891059, 1016882093, 1039034812, 1038641729, -1116411663, -1097118007, 1048401811, 1047026266, 1049575497, 1009169129, -1160076014, 1011659862, -1118989247, -1098196617, 1047871026, 1053409203, 1047943510, 1028832833, 1039575701, 1016848770, -1107982291, -1100508712, 1053176410, 1049086161, 1026229790, 1028870147, 1033250438, -1102447148, 1016730339, 1039085981, 1046650545, -1111456391, 1010041312, 1039907830, -1115884456, -1100794236, 1047222310, 1047138330, 1029887476, -1095866984, -1156912630, 1042796684, -1099406655, -1190581212, 1036856436, 1028031917, 1032981253, -1100583717, -1119837778, 1040467228, -1095242745, -1121862223, -1106888329, -1109691650, -1119103496, 1041966119, etc. etc.

In fact the log file is 17 GB.

Anyone interested in this before I take off and nuke the site from orbit?

o-smirnov commented 6 years ago

I'd still like to see the log. Sounds like casacore tables are spewing something at us because of a weirdly shaped BITFLAG column.

Geez, how hard can it be to insert an MS column? :(

Running with --flags-reinit-bitflags 1 ought to set this puppy straight. But before you do, could you please save a copy of the MS (and the logfile) aside so I can examine it?

IanHeywood commented 6 years ago

I'm scp-ing the MS and log file to Nash. It's 348+17 GB though so might take a while.

I should add that the CubiCal run ends successfully, at least according to the log file. I'll re-run with the reinit-bitflags switch once the copy is complete.

IanHeywood commented 6 years ago

Log and MS on nash, in /home/ianh/verbose.