ratt-ru / dask-ms

Implementation of a dask/xarray dataset backed by a CASA MS
https://dask-ms.readthedocs.io
Other
19 stars 7 forks source link

what's the best way to apply a channel slice? #109

Open o-smirnov opened 4 years ago

o-smirnov commented 4 years ago

Been asking elsewhere, but it occurs to me it's a general enough question to be usefully asked here....

So, what's the best way to read a subset of channels? I can apply a slice to the array objects of course, but is there a way to do this up front? Or a better way?

sjperkins commented 4 years ago

To do this with maximal efficiency, you' have to use a pre-process to figure out the optimal chunking strategy for the channel dimensions

# Initial dataset partition on FIELD_ID and DATA_DESC_ID
ddids = [ds.DATA_DESC_ID for ds in xds_from_ms("3C286.ms")]
# Read in very small DATA_DESCRIPTION table into memory
ddid = xds_from_table("3C286.ms::DATA_DESCRIPTION").compute()
# Create a dataset per row from SPECTRAL_WINDOW
spws = xds_from_table("3C286.ms::SPECTRAL_WINDOW", group_cols="__row__")
# Number of channels for each dataset
nchan = [spws[ddid[i].SPECTRAL_WINDOW_ID[0]].CHAN_FREQ.shape[0] for i in ddids]
# Channel chunking schema for each dataset
chan_chunks = [(chanslice.start - 0, chanslice.end - chanslice.start, nc - chanslice.end)
                          for nc in nchan]

# Chunking schema for each dataset
chunks = [{'row': 100000, 'chan': cc} for cc in chan_chunks]

# Re-open exact same datasets with a different chunking strategy
datasets = xds_from_ms("3C286.ms", chunks=chunks)

# This should slice the channel selection optimally without dask block overlap
datasets[0].DATA[:, chanslice, :]

I typed this out without running it, but it should illustrate the idea.

The above is clunky, I'm thinking about general improvements for the process in https://github.com/ska-sa/dask-ms/pull/86

Mulan-94 commented 4 years ago

@sjperkins , I'm unable to follow properly. I usually just do dataset[0].sel(chan=slice(start, end)) . And the chunking for the channel seems to be reconstructed automatically by dask. Is this the wrong way :( ?

sjperkins commented 4 years ago

@sjperkins , I'm unable to follow properly. I usually just do dataset[0].sel(chan=slice(start, end)) . And the chunking for the channel seems to be reconstructed automatically by dask. Is this the wrong way :( ?

It's not wrong, but that way will end up reading more data from the MS than necessary. What the example is trying to demonstrate is that to prevent this behaviour the dask chunking must be correctly set up in xds_from_ms(..., chunks={...}).

I agree that it's not necessarily easy to follow, I simply haven't found a reasonable way of making this easy yet. https://github.com/ska-sa/dask-ms/pull/86 should probably form the nucleus for dealing with this.

sjperkins commented 4 years ago

Lets look at this with a concrete example

nchan = 64
chanslice = slice(37, 56)
chan_chunks = (37, 19, 8) = (37 - 0, 56 - 37, 64 - 56)

datasets = xds_from_ms(..., chunks={..., 'chan': chan_chunks})
ds.DATA.data[:, chanslice, :]

that will create DATA dask arrays with channel chunks of (37, 19, 8). Each channel chunk will read that channel chunk from the MS exactly using getcolslice.

If by contrast, we don't ask for channel chunking:

datasets = xds_from_ms(...)
ds.DATA.data[:, chanslice, :]

there'll be a single channel chunk for the dask DATA arrays containing the full 64 channels. So there'll be a getcol operation behind that reads all the channel data followed by a slice that returns the channel subset.

sjperkins commented 4 years ago

I also want to say its good that people are pointing this out as a conceptually difficult thing, because I do want to make things easier and reduce the cognitive overhead.

o-smirnov commented 4 years ago

As an aging professor, I can only applaud the sentiment. My cognitive is all overhead!

Mulan-94 commented 4 years ago

that will create DATA dask arrays with channel chunks of (37, 19, 8). Each channel chunk will read that channel chunk from the MS exactly using getcolslice.

This makes it clearer, thanks :) !