Graph Optimisations - Githubissues

ratt-ru / dask-ms

Implementation of a dask/xarray dataset backed by a CASA MS

Other

19 stars 7 forks source link

[x] Tests added / passed
```
$ py.test -v -s daskms/tests
```
If the pep8 tests fail, the quickest way to correct this is to run autopep8 and then flake8 and pycodestyle to fix the remaining issues.
```
$ pip install -U autopep8 flake8 pycodestyle
$ autopep8 -r -i daskms
$ flake8 daskms
$ pycodestyle daskms
```
[x] Fully documented, including HISTORY.rst for all changes and one of the docs/*-api.rst files for new API

To build the docs locally:
```
pip install -r requirements.readthedocs.txt
cd docs
READTHEDOCS=True make html
```

Prior to this PR, each getcol for a chunk of rows shared a number of common ancestors. In the case where grouping is performed on columns:

A single common ancestor representing the rowids for the entire group.
Each getcol for each array share row runs for the same range of rowids.

original .

This graph structure meant that dask did not view each getcol as an independent, graph root which meant that it tended to traverse the graph in a breadth-first pattern. In the case of the predict, where large parallel reductions over sources are performed, this would lead to OutOfMemory errors. See for e.g.https://github.com/paoloserra/crystalball/issues/15#issuecomment-557492139 and https://github.com/paoloserra/crystalball/pull/33#issuecomment-559133527.

This Pull Request removes the ROWID and row run calculations as explicit nodes in the graph and replaces them with an internal caching mechanism. This means that individual getcol operations are viewed as independent by the dask scheduler:

new

Note that prior to the rewrite in https://github.com/ska-sa/dask-ms/pull/41, the graph structure would have been similar, although the rowid's/row runs would have been embedded in the graph as numpy arrays.

ratt-ru / dask-ms

Graph Optimisations #75