ratt-ru / xova

dask-ms/codex-africanus MS averaging tool
Other
0 stars 2 forks source link

Prepare 0.1.2 #28

Closed bennahugo closed 1 year ago

bennahugo commented 1 year ago

Fixes many failing test cases and makes the UVW recompute fall over gently when the test case ephem polynomial is > 1

bennahugo commented 1 year ago

I think I will try and fix the REST issue in another PR... not sure where it comes from exactly -- maybe deep in daskms land

bennahugo commented 1 year ago

Hmm... tests are not actually triggering because they are travis remnants... will try put in a action based ci for this in this PR as well

bennahugo commented 1 year ago

Actually no... test cases needs casacore-data... will configure a Jenkins job for this

bennahugo commented 1 year ago

Struggling to get the tests to pass in a clean environment. Latest is

               # We only need to pass in dimension extent arrays if
                # there is more than one chunk in any of the non-row columns.
                # In that case, we can putcol, otherwise putcolslice is required

                inlinable_arrays = [row_order]

                if (row_order.shape[0] != array.shape[0] or
                        row_order.chunks[0] != array.chunks[0]):
>                   raise ValueError(f"ROWID shape and/or chunking does "
                                     f"not match that of {column}")
E                   ValueError: ROWID shape and/or chunking does not match that of ANTENNA1

Will need to wait for next week while I focus on my PhD.

bennahugo commented 1 year ago

I'm really not sure how I got this passing on my production machine

xova/apps/xova/app.py:107: in execute
    main_writes = xds_to_table(output_ds, args.output, "ALL",
../venvxova/lib/python3.8/site-packages/daskms/dask_ms.py:96: in xds_to_table
    out_ds = write_datasets(table_name, xds, columns,
../venvxova/lib/python3.8/site-packages/daskms/writes.py:725: in write_datasets
    write_datasets = _write_datasets(table, tp, datasets, columns,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

table = '/tmp/tmpud07fd4u_averaged.ms'
table_proxy = TableProxy[/tmp/tmpud07fd4u_averaged.ms](table, /tmp/tmpud07fd4u_averaged.ms, ack=False,readonly=False,lockoptions=user,__executor_key__=/tmp/tmpud07fd4u_averaged.ms)
datasets = []
columns = ['ANTENNA1', 'ANTENNA2', 'ARRAY_ID', 'DATA', 'DATA_DESC_ID', 'EXPOSURE', ...]
descriptor = 'ms(False)', table_keywords = None, column_keywords = None

    def _write_datasets(table, table_proxy, datasets, columns, descriptor,
                        table_keywords, column_keywords):
        _, table_name, subtable = table_path_split(table)
        table_name = '::'.join((table_name, subtable)) if subtable else table_name
        row_orders = []

        # Put table and column keywords
        table_proxy.submit(_put_keywords, WRITELOCK,
                           table_keywords, column_keywords).result()

        # Sort datasets on (not has "ROWID", index) such that
        # datasets with ROWID's are handled first, while
        # those without (which imply appends to the MS)
        # are handled last
        sorted_datasets = sorted(enumerate(datasets),
                                 key=lambda t: ("ROWID" not in t[1].data_vars,
                                                t[0]))

        # Establish row orders for each dataset
        for di, ds in sorted_datasets:
            try:
                rowid = ds.ROWID.data
            except AttributeError:
                # Add operation
                # No ROWID's, assume they're missing from the table
                # and remaining datasets. Generate addrows
                # NOTE(sjperkins)
                # This could be somewhat brittle, but exists to
                # update MS empty subtables once they've been
                # created along with the main MS by a call to default_ms.
                # Users could also it to append rows to an existing table.
                # An xds_append_to_table may be a better solution...
                last_datasets = datasets[di:]
                last_row_orders = add_row_order_factory(table_proxy, last_datasets)

                # We don't inline the row ordering if it is derived
                # from the row sizes of provided arrays.
                # The range of possible dependencies are far too large to inline
                row_orders.extend([(False, lro) for lro in last_row_orders])
                # We have established row orders for all datasets
                # at this point, quit the loop
                break
            else:
                # Update operation
                # Generate row orderings from existing row IDs
                row_order = cached_row_order(rowid)

                # Inline the row ordering in the graph
                row_orders.append((True, row_order))

        assert len(row_orders) == len(datasets)

        datasets = []

        for (di, ds), (inline, row_order) in zip(sorted_datasets, row_orders):
            # Hold the variables representing array writes
            write_vars = {}

            # Generate a dask array for each column
            for column in columns:
                try:
                    variable = ds.data_vars[column]
                except KeyError:
                    log.warning("Ignoring '%s' not present "
                                "on dataset %d" % (column, di))
                    continue
                else:
                    full_dims = variable.dims
                    array = variable.data

                if not isinstance(array, da.Array):
                    raise TypeError("%s on dataset %d is not a dask Array "
                                    "but a %s" % (column, di, type(array)))

                args = [row_order, ("row",)]

                # We only need to pass in dimension extent arrays if
                # there is more than one chunk in any of the non-row columns.
                # In that case, we can putcol, otherwise putcolslice is required

                inlinable_arrays = [row_order]

                if (row_order.shape[0] != array.shape[0] or
                        row_order.chunks[0] != array.chunks[0]):
>                   raise ValueError(f"ROWID shape and/or chunking does "
                                     f"not match that of {column}")
E                   ValueError: ROWID shape and/or chunking does not match that of ANTENNA1

It seems the error stems deep from within daskms land and possibly due to dask's stupidity with handling chunk shaps correctly. The issue is not the same as with previous dask[array] versions where a reduction over axis 1 of UVW fails (uv dist computation).

Bit at wits end with this so will commit what I have right now and get back to this when I get to my desktop -- maybe pip freeze will be telling of which dask versions to pin to.

bennahugo commented 1 year ago

actually I have it working. There is breakage either @sjperkins or @JSKenyon introduced inside daskms since version 0.2.6 was released (I note there has been a lot of changes to ms output and chunking which could have caused this). I don't really have time to dig into daskms right now, but suffice to work around the upstream issues by pinning to dask-ms==0.2.6.

bennahugo commented 1 year ago

Working daskms versions: 0.2.6 0.2.7 0.2.8 0.2.9 0.2.10 0.2.11

Specifically breakage started appearing in 0.2.12

bennahugo commented 1 year ago

retest this please

bennahugo commented 1 year ago

Alright as discussed with @sjperkins we are going to keep only jenkins testing for now. I will open a separate PR just to test the install.

The long running plan is to put in full qualification testing on this with real data and simulated data (essentially automating what I've done for the memo) onto both axes.