Closed JSKenyon closed 2 years ago
I am going to posit a theory - only subtables which include a data description field in this memo need to be grouped by __row__
. This essentially includes all subtables which self-describe their shapes i.e. may have variable shapes per row. If I am correct, the "nonuniform" subtables would be:
SPECTRAL_WINDOW
POLARIZATION
FEED
SOURCE
These tables shouldn't have a huge number of rows i.e. grouping by __row__
shouldn't be a problem.
This should be fixed by #191. There is likely still room for improvement but at least parquet conversion happens in a reasonable amount of time now.
Description
dask-ms convert -o test.parquet -f parquet test.ms
can be incredibly slow, even on small datasets. The reason for this is that some of the subtables may have thousands of rows and all subtables are grouped by__row__
. This can lead to thousands of datasets and very poor performance (hours+ for a few GB dataset).The specific subtables which I have noticed cause a problem (but this isn't guaranteed to be a complete list) are
HISTORY
andPOINTING
. I do not believe that either of these need to be grouped by__row__
, as I believe they should not have variable shapes.Verifying the slow behaviour is simple. I invoked
simms
with the following command:and then ran the conversion command above.
I implemented a rudimentary patch to differentiate between uniform and non-uniform subtables (just using their names), and this seemed to fix the problem. I hoped to have a discussion on whether or not we can safely assume subtables like those mentioned above are exempt from ordering by
__row__
.