ratt-ru / dask-ms

Implementation of a dask/xarray dataset backed by a CASA MS
https://dask-ms.readthedocs.io
Other
19 stars 7 forks source link

Graph Optimisations #75

Closed sjperkins closed 4 years ago

sjperkins commented 4 years ago
sjperkins commented 4 years ago

Prior to this PR, each getcol for a chunk of rows shared a number of common ancestors. In the case where grouping is performed on columns:

  1. A single common ancestor representing the rowids for the entire group.
  2. Each getcol for each array share row runs for the same range of rowids.

original.

This graph structure meant that dask did not view each getcol as an independent, graph root which meant that it tended to traverse the graph in a breadth-first pattern. In the case of the predict, where large parallel reductions over sources are performed, this would lead to OutOfMemory errors. See for e.g.https://github.com/paoloserra/crystalball/issues/15#issuecomment-557492139 and https://github.com/paoloserra/crystalball/pull/33#issuecomment-559133527.

This Pull Request removes the ROWID and row run calculations as explicit nodes in the graph and replaces them with an internal caching mechanism. This means that individual getcol operations are viewed as independent by the dask scheduler:

new

Note that prior to the rewrite in https://github.com/ska-sa/dask-ms/pull/41, the graph structure would have been similar, although the rowid's/row runs would have been embedded in the graph as numpy arrays.