Open JSKenyon opened 4 years ago
As it stands, dask-ms will treat a multi-ms as a single monolithic ms. This means it is interpreted as having only a single main table which in turn sacrifices some potential parallelism. It would be ideal if dask-ms understood the multi-ms and handled some of its intricacies for the user.
I understand that your perspective here is from a performance POV, especially in light of https://github.com/casacore/casacore/issues/1038 and https://github.com/casacore/python-casacore/pull/209.
However, I do want to point out that (as I understand it) the entire point of the Multi-MS is to present multiple MS's as a monolithic MS. This is also related to the virtual tables created by TAQL queries -- they're actually reference tables pointing back to an original table.
The reason why I believe this belongs in dask-ms is that doing this in an application then requires having an additional layer of pyrap.table calls in order to disentangle the internal partitioning. This just seems error prone/messy on the user end.
I'm concerned that this request is asking dask-ms to reverse engineer the intended virtual interface provided by CASA Tables in order to resolve performance problems. This would likely involve reproducing substantial internal CASA table logic, correctly.
Put differently, a multi-ms is baking a cake. To unbake the cake would be a Herculean task. I'd be interested in seeing the casacore maintainer's response to https://github.com/casacore/python-casacore/pull/209.
/cc @IanHeywood who was also interested in multi-ms's
However, I do want to point out that (as I understand it) the entire point of the Multi-MS is to present multiple MS's as a monolithic MS. This is also related to the virtual tables created by TAQL queries -- they're actually reference tables pointing back to an original table.
I don't 100% agree on this point. The aim was to allow for MMS aware casa tasks to make use of parallelism whilst everything else could treat it as a monolithic MS. See this PDF on the subject.
Essentially my request would be to make dask-ms MMS aware. I have tried an application-end test (very ugly, not perfect, almost certain there are some threading related things to work out) and see a decent (factor of x3) improvement in performance by reading in the sub-MSs directly.
I don't really believe that this even constitutes unbaking the cake - all it means is that a given row range lies in a specific subms. Whether the backing getcol points at the subms vs the monolithic ms seems like a relatively minor detail. Of course, I know that it is not that simple, and keeping track of all the subms tables could get difficult.
This is by no means a requirement, just thought it was worth putting my thoughts together in an issue.
The performance factor is best taken with a pinch of salt as I think there are some discrepancies in my tests. But there definitely is an improvement.
I don't 100% agree on this point. The aim was to allow for MMS aware casa tasks to make use of parallelism whilst everything else could treat it as a monolithic MS. See this PDF on the subject.
Ah thanks for pointing that, I haven't worked with MMS much so my experience of them is limited.
I guess something like the following is currently possible (although python-casacore and the GIL is still an issue...)
from itertools import chain
from daskms import xds_from_ms
mms = ["ms01.ms", "ms02.ms", "ms03.m3"]
datasets = list(chain(map(xds_from_ms, mms)))
What exactly distinguishes a normal MS from a multi-MS?
This is by no means a requirement, just thought it was worth putting my thoughts together in an issue.
Yes, lets keep the discussion going.
What exactly distinguishes a normal MS from a multi-MS?
No need to reply, I'm reading the PDF which seems to describe it well.
Hi everyone, is there any news on this? I am also interested on using mms files on dask-ms. Although I think that since dask-ms groups the columns by FIELD_ID and DATA_DESC_ID is not much different from using a MMS file. And the "partitioned" dask-ms dataset also can be processed in parallel, so what would be the big difference between reading a mms file and reading the non-mms file using dask-ms?
Hi everyone, is there any news on this? I am also interested on using mms files on dask-ms. Although I think that since dask-ms groups the columns by FIELD_ID and DATA_DESC_ID is not much different from using a MMS file. And the "partitioned" dask-ms dataset also can be processed in parallel, so what would be the big difference between reading a mms file and reading the non-mms file using dask-ms?
Hi @miguelcarcamov. I think the issue here is that:
I've done some work on a new pybind11 wrapper for CASA Tables to solve the above issues, but haven't touched it recently. https://github.com/ratt-ru/futurecasapy
In the long term, we're exploring newer formats for the Measurement Set v{2,3} spec. Earlier this year, zarr and parquet support for CASA Table-like data was added to dask-ms master branch -- these have proven highly performant on a supercomputer -- orders of magnitude faster compared to the current CASA Table system, although it is possible that it hasn't been optimally compiled.
Description
This is very much a non-urgent, wish list feature. As it stands, dask-ms will treat a multi-ms as a single monolithic ms. This means it is interpreted as having only a single main table which in turn sacrifices some potential parallelism. It would be ideal if dask-ms understood the multi-ms and handled some of its intricacies for the user. The reason why I believe this belongs in dask-ms is that doing this in an application then requires having an additional layer of
pyrap.table
calls in order to disentangle the internal partitioning. This just seems error prone/messy on the user end. That being said, it is possible to do this on the application side, so if that is going to be the paradigm, I can make it happen.