ratt-ru / dask-ms

Implementation of a dask/xarray dataset backed by a CASA MS
https://dask-ms.readthedocs.io
Other
19 stars 7 forks source link

`dask-ms convert` from MS to Parquet is very slow #189

Closed JSKenyon closed 2 years ago

JSKenyon commented 2 years ago

Description

dask-ms convert -o test.parquet -f parquet test.ms can be incredibly slow, even on small datasets. The reason for this is that some of the subtables may have thousands of rows and all subtables are grouped by __row__. This can lead to thousands of datasets and very poor performance (hours+ for a few GB dataset).

The specific subtables which I have noticed cause a problem (but this isn't guaranteed to be a complete list) are HISTORY and POINTING. I do not believe that either of these need to be grouped by __row__, as I believe they should not have variable shapes.

Verifying the slow behaviour is simple. I invoked simms with the following command:

simms -dec 30d0m0s -ra 0h0m0s -T vla -dt 2 -st 2 -nc 64 -f0 1712MHz -df 208kHz -pl RR RL LR LL -n test.ms

and then ran the conversion command above.

I implemented a rudimentary patch to differentiate between uniform and non-uniform subtables (just using their names), and this seemed to fix the problem. I hoped to have a discussion on whether or not we can safely assume subtables like those mentioned above are exempt from ordering by __row__.

JSKenyon commented 2 years ago

I am going to posit a theory - only subtables which include a data description field in this memo need to be grouped by __row__. This essentially includes all subtables which self-describe their shapes i.e. may have variable shapes per row. If I am correct, the "nonuniform" subtables would be:

These tables shouldn't have a huge number of rows i.e. grouping by __row__ shouldn't be a problem.

JSKenyon commented 2 years ago

This should be fixed by #191. There is likely still room for improvement but at least parquet conversion happens in a reasonable amount of time now.