Open o-smirnov opened 4 years ago
To do this with maximal efficiency, you' have to use a pre-process to figure out the optimal chunking strategy for the channel dimensions
# Initial dataset partition on FIELD_ID and DATA_DESC_ID
ddids = [ds.DATA_DESC_ID for ds in xds_from_ms("3C286.ms")]
# Read in very small DATA_DESCRIPTION table into memory
ddid = xds_from_table("3C286.ms::DATA_DESCRIPTION").compute()
# Create a dataset per row from SPECTRAL_WINDOW
spws = xds_from_table("3C286.ms::SPECTRAL_WINDOW", group_cols="__row__")
# Number of channels for each dataset
nchan = [spws[ddid[i].SPECTRAL_WINDOW_ID[0]].CHAN_FREQ.shape[0] for i in ddids]
# Channel chunking schema for each dataset
chan_chunks = [(chanslice.start - 0, chanslice.end - chanslice.start, nc - chanslice.end)
for nc in nchan]
# Chunking schema for each dataset
chunks = [{'row': 100000, 'chan': cc} for cc in chan_chunks]
# Re-open exact same datasets with a different chunking strategy
datasets = xds_from_ms("3C286.ms", chunks=chunks)
# This should slice the channel selection optimally without dask block overlap
datasets[0].DATA[:, chanslice, :]
I typed this out without running it, but it should illustrate the idea.
The above is clunky, I'm thinking about general improvements for the process in https://github.com/ska-sa/dask-ms/pull/86
@sjperkins , I'm unable to follow properly. I usually just do dataset[0].sel(chan=slice(start, end))
. And the chunking for the channel seems to be reconstructed automatically by dask
.
Is this the wrong way :( ?
@sjperkins , I'm unable to follow properly. I usually just do
dataset[0].sel(chan=slice(start, end))
. And the chunking for the channel seems to be reconstructed automatically bydask
. Is this the wrong way :( ?
It's not wrong, but that way will end up reading more data from the MS than necessary. What the example is trying to demonstrate is that to prevent this behaviour the dask chunking must be correctly set up in xds_from_ms(..., chunks={...})
.
I agree that it's not necessarily easy to follow, I simply haven't found a reasonable way of making this easy yet. https://github.com/ska-sa/dask-ms/pull/86 should probably form the nucleus for dealing with this.
Lets look at this with a concrete example
nchan = 64
chanslice = slice(37, 56)
chan_chunks = (37, 19, 8) = (37 - 0, 56 - 37, 64 - 56)
datasets = xds_from_ms(..., chunks={..., 'chan': chan_chunks})
ds.DATA.data[:, chanslice, :]
that will create DATA dask arrays with channel chunks of (37, 19, 8). Each channel chunk will read that channel chunk from the MS exactly using getcolslice.
If by contrast, we don't ask for channel chunking:
datasets = xds_from_ms(...)
ds.DATA.data[:, chanslice, :]
there'll be a single channel chunk for the dask DATA arrays containing the full 64 channels. So there'll be a getcol operation behind that reads all the channel data followed by a slice that returns the channel subset.
I also want to say its good that people are pointing this out as a conceptually difficult thing, because I do want to make things easier and reduce the cognitive overhead.
As an aging professor, I can only applaud the sentiment. My cognitive is all overhead!
that will create DATA dask arrays with channel chunks of (37, 19, 8). Each channel chunk will read that channel chunk from the MS exactly using
getcolslice
.
This makes it clearer, thanks :) !
Been asking elsewhere, but it occurs to me it's a general enough question to be usefully asked here....
So, what's the best way to read a subset of channels? I can apply a slice to the array objects of course, but is there a way to do this up front? Or a better way?