ratt-ru / tricolour

Holds an offline, MS direct version of the SDP online flagger.
Other
8 stars 4 forks source link

Need mutex around table writing #24

Closed bennahugo closed 5 years ago

bennahugo commented 5 years ago

Think the current locking system does not work properly with multiple scans running in parallel. It looks like there is a race on the lock acquisition:

tricolour - 2019-05-15 13:09:52,980 INFO - Adding field 'G332.31-0.36' scan 228 to compute graph for processing
tricolour - 2019-05-15 13:09:53,085 INFO - Adding field 'G332.65+0.35' scan 229 to compute graph for processing
tricolour - 2019-05-15 13:09:53,201 INFO - Adding field 'G331.95+0.35' scan 230 to compute graph for processing
tricolour - 2019-05-15 13:09:53,294 INFO - Adding field 'G331.24+0.35' scan 231 to compute graph for processing
tricolour - 2019-05-15 13:09:53,407 INFO - Adding field 'G330.89-0.36' scan 234 to compute graph for processing
tricolour - 2019-05-15 13:09:53,502 INFO - Adding field 'G331.59-0.36' scan 235 to compute graph for processing
tricolour - 2019-05-15 13:09:53,608 INFO - Adding field 'G332.31-0.36' scan 236 to compute graph for processing
tricolour - 2019-05-15 13:09:53,716 INFO - Adding field 'G332.65+0.35' scan 237 to compute graph for processing
tricolour - 2019-05-15 13:09:53,808 INFO - Adding field 'G331.95+0.35' scan 238 to compute graph for processing
tricolour - 2019-05-15 13:09:53,883 INFO - Adding field 'G331.24+0.35' scan 239 to compute graph for processing
tricolour - 2019-05-15 13:09:53,986 INFO - Adding field 'G330.89-0.36' scan 242 to compute graph for processing
[                                        ] | 2% Completed | 18min 51.6s/home/hugo/tricolour/tricolour/flagging.py:956: RuntimeWarning: Mean of empty slice
  avgvis = np.nanmean(vis_scratch[:, :, corr], axis=0)
[##                                      ] | 6% Completed | 22min 31.1sSuccessful read/write open of user-locked table msdir/1524852817.1GC.ms: 23 columns, 975697 rows
[#################################       ] | 84% Completed |  1hr 28min 34.2s
Unexpected error. Dropping you into pdb for a post-mortem.
Traceback (most recent call last):
  File "/home/hugo/venv/bin/tricolour", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/home/hugo/tricolour/tricolour/scripts/tricolour", line 5, in <module>
    tricolour.main()
  File "/home/hugo/tricolour/tricolour/__init__.py", line 495, in main
    dask.compute(write_computes)
  File "/home/hugo/venv/local/lib/python2.7/site-packages/dask/base.py", line 398, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/hugo/venv/local/lib/python2.7/site-packages/dask/threaded.py", line 76, in get
    pack_exception=pack_exception, **kwargs)
  File "/home/hugo/venv/local/lib/python2.7/site-packages/dask/local.py", line 462, in get_async
    raise_exception(exc, tb)
  File "/home/hugo/venv/local/lib/python2.7/site-packages/dask/local.py", line 230, in execute_task
    result = _execute_task(task, data)
  File "/home/hugo/venv/local/lib/python2.7/site-packages/dask/core.py", line 119, in _execute_task
    return func(*args2)
  File "/home/hugo/venv/local/lib/python2.7/site-packages/xarrayms/xarray_ms.py", line 245, in _chunk_putcols_np
    startrow=rs, nrow=rl)
  File "/home/hugo/venv/local/lib/python2.7/site-packages/xarrayms/table_proxy.py", line 71, in __call__
    return getattr(table, fn)(*args, **kwargs)
  File "/home/hugo/venv/local/lib/python2.7/site-packages/casacore/tables/table.py", line 1157, in putcol
    self._putcol(columnname, startrow, nrow, rowincr, value)
RuntimeError: ColumnSet::doLock: table /scratch/bhugo/G330_2/msdir/1524852817.1GC.ms should be locked when using UserLocking
> /home/hugo/venv/local/lib/python2.7/site-packages/casacore/tables/table.py(1157)putcol()
-> self._putcol(columnname, startrow, nrow, rowincr, value)

The fix should be simple - ensure only one process can acquire the lock from casacore? @sjperkins

bennahugo commented 5 years ago

This is still with the stable 0.2.0 branch

bennahugo commented 5 years ago

This is probably a bit more of a xarray-ms bug

bennahugo commented 5 years ago

I think you may have different threads of the same field attempting to write simultaneously because the scans are short

bennahugo commented 5 years ago

Alternative is to put it in a queue and have xarrayms have a single io thread?

o-smirnov commented 5 years ago

Alternative is to put it in a queue and have xarrayms have a single io thread?

Totally the way to go, at least for the writing....

sjperkins commented 5 years ago

@bennahugo Did the fix in https://github.com/ska-sa/xarray-ms/pull/31 succeed?