hail-is / hail

Cloud-native genomic dataframes and batch computing
https://hail.is
MIT License
965 stars 242 forks source link

MakeNDArray OOM on stream data #14559

Closed ehigham closed 3 months ago

ehigham commented 3 months ago

What happened?

From: https://discuss.hail.is/t/connectionerror-after-mt-aggregate-cols-hl-agg-collect-and-hl-nd-array-in-linear-skat/3839

The following code snippet blows up from OOM. Interestingly, I can only reproduce for _localize=False, indicating we have an problem with our Emit rule for MakeNDArray for data IRs of type TStream.

hl.init()

mt = hl.utils.range_matrix_table(n_rows=7944, n_cols=442075)
covariates = [1.0]

mt = mt.select_cols(covariates=covariates)
covmat = mt.aggregate_cols(
    hl.agg.collect(mt.covariates.map(hl.float)),
    _localize=False,
)

hl.nd.array(covmat).show() # boom

Version

0.2.130

Relevant log output

No response

ehigham commented 3 months ago

Hard to pinpoint what's the true cause of this bug.

  1. We're doing an excessive amount of copying and resizing of the output ndarray because we've lost information about the bounds of the ndarray. Arguably that's a harder problem in real world contexts.
  2. We're computing the aggregation twice owing to a problem in CSE: the same underlying collection ir is used to compute the shape and the flattened array of data. If you modify the code above to be:
    hl.bind(covmat, hl.nd.array).show()

    the program successfully completes