ratt-ru / dask-ms

Implementation of a dask/xarray dataset backed by a CASA MS
https://dask-ms.readthedocs.io
Other
19 stars 7 forks source link

Improve multi-ms functionality #121

Open JSKenyon opened 4 years ago

JSKenyon commented 4 years ago

Description

This is very much a non-urgent, wish list feature. As it stands, dask-ms will treat a multi-ms as a single monolithic ms. This means it is interpreted as having only a single main table which in turn sacrifices some potential parallelism. It would be ideal if dask-ms understood the multi-ms and handled some of its intricacies for the user. The reason why I believe this belongs in dask-ms is that doing this in an application then requires having an additional layer of pyrap.table calls in order to disentangle the internal partitioning. This just seems error prone/messy on the user end. That being said, it is possible to do this on the application side, so if that is going to be the paradigm, I can make it happen.

sjperkins commented 4 years ago

As it stands, dask-ms will treat a multi-ms as a single monolithic ms. This means it is interpreted as having only a single main table which in turn sacrifices some potential parallelism. It would be ideal if dask-ms understood the multi-ms and handled some of its intricacies for the user.

I understand that your perspective here is from a performance POV, especially in light of https://github.com/casacore/casacore/issues/1038 and https://github.com/casacore/python-casacore/pull/209.

However, I do want to point out that (as I understand it) the entire point of the Multi-MS is to present multiple MS's as a monolithic MS. This is also related to the virtual tables created by TAQL queries -- they're actually reference tables pointing back to an original table.

The reason why I believe this belongs in dask-ms is that doing this in an application then requires having an additional layer of pyrap.table calls in order to disentangle the internal partitioning. This just seems error prone/messy on the user end.

I'm concerned that this request is asking dask-ms to reverse engineer the intended virtual interface provided by CASA Tables in order to resolve performance problems. This would likely involve reproducing substantial internal CASA table logic, correctly.

Put differently, a multi-ms is baking a cake. To unbake the cake would be a Herculean task. I'd be interested in seeing the casacore maintainer's response to https://github.com/casacore/python-casacore/pull/209.

sjperkins commented 4 years ago

/cc @IanHeywood who was also interested in multi-ms's

JSKenyon commented 4 years ago

However, I do want to point out that (as I understand it) the entire point of the Multi-MS is to present multiple MS's as a monolithic MS. This is also related to the virtual tables created by TAQL queries -- they're actually reference tables pointing back to an original table.

I don't 100% agree on this point. The aim was to allow for MMS aware casa tasks to make use of parallelism whilst everything else could treat it as a monolithic MS. See this PDF on the subject.

Essentially my request would be to make dask-ms MMS aware. I have tried an application-end test (very ugly, not perfect, almost certain there are some threading related things to work out) and see a decent (factor of x3) improvement in performance by reading in the sub-MSs directly.

I don't really believe that this even constitutes unbaking the cake - all it means is that a given row range lies in a specific subms. Whether the backing getcol points at the subms vs the monolithic ms seems like a relatively minor detail. Of course, I know that it is not that simple, and keeping track of all the subms tables could get difficult.

This is by no means a requirement, just thought it was worth putting my thoughts together in an issue.

JSKenyon commented 4 years ago

The performance factor is best taken with a pinch of salt as I think there are some discrepancies in my tests. But there definitely is an improvement.

sjperkins commented 4 years ago

I don't 100% agree on this point. The aim was to allow for MMS aware casa tasks to make use of parallelism whilst everything else could treat it as a monolithic MS. See this PDF on the subject.

Ah thanks for pointing that, I haven't worked with MMS much so my experience of them is limited.

I guess something like the following is currently possible (although python-casacore and the GIL is still an issue...)

from itertools import chain
from daskms import xds_from_ms

mms = ["ms01.ms", "ms02.ms", "ms03.m3"]

datasets = list(chain(map(xds_from_ms, mms)))

What exactly distinguishes a normal MS from a multi-MS?

This is by no means a requirement, just thought it was worth putting my thoughts together in an issue.

Yes, lets keep the discussion going.

sjperkins commented 4 years ago

What exactly distinguishes a normal MS from a multi-MS?

No need to reply, I'm reading the PDF which seems to describe it well.

miguelcarcamov commented 3 years ago

Hi everyone, is there any news on this? I am also interested on using mms files on dask-ms. Although I think that since dask-ms groups the columns by FIELD_ID and DATA_DESC_ID is not much different from using a MMS file. And the "partitioned" dask-ms dataset also can be processed in parallel, so what would be the big difference between reading a mms file and reading the non-mms file using dask-ms?

sjperkins commented 3 years ago

Hi everyone, is there any news on this? I am also interested on using mms files on dask-ms. Although I think that since dask-ms groups the columns by FIELD_ID and DATA_DESC_ID is not much different from using a MMS file. And the "partitioned" dask-ms dataset also can be processed in parallel, so what would be the big difference between reading a mms file and reading the non-mms file using dask-ms?

Hi @miguelcarcamov. I think the issue here is that:

  1. CASA Tables aren't thread safe and must therefore be accessed from a single thread. https://github.com/casacore/casacore/issues/1038
  2. python-casacore doesn't drop the Global Interpreter Lock (GIL) when accessing CASA Table data. https://github.com/casacore/python-casacore/pull/209. This is probably the most painful as it means that no compute happens while waiting for data.

I've done some work on a new pybind11 wrapper for CASA Tables to solve the above issues, but haven't touched it recently. https://github.com/ratt-ru/futurecasapy

In the long term, we're exploring newer formats for the Measurement Set v{2,3} spec. Earlier this year, zarr and parquet support for CASA Table-like data was added to dask-ms master branch -- these have proven highly performant on a supercomputer -- orders of magnitude faster compared to the current CASA Table system, although it is possible that it hasn't been optimally compiled.