Closed sjperkins closed 4 years ago
Prior to this PR, each getcol for a chunk of rows shared a number of common ancestors. In the case where grouping is performed on columns:
.
This graph structure meant that dask did not view each getcol as an independent, graph root which meant that it tended to traverse the graph in a breadth-first pattern. In the case of the predict, where large parallel reductions over sources are performed, this would lead to OutOfMemory errors. See for e.g.https://github.com/paoloserra/crystalball/issues/15#issuecomment-557492139 and https://github.com/paoloserra/crystalball/pull/33#issuecomment-559133527.
This Pull Request removes the ROWID and row run calculations as explicit nodes in the graph and replaces them with an internal caching mechanism. This means that individual getcol operations are viewed as independent by the dask scheduler:
Note that prior to the rewrite in https://github.com/ska-sa/dask-ms/pull/41, the graph structure would have been similar, although the rowid's/row runs would have been embedded in the graph as numpy arrays.
[x] Tests added / passed
If the pep8 tests fail, the quickest way to correct this is to run
autopep8
and thenflake8
andpycodestyle
to fix the remaining issues.[x] Fully documented, including
HISTORY.rst
for all changes and one of thedocs/*-api.rst
files for new APITo build the docs locally: