lincc-frameworks / tape

[Deprecated] Package for working with LSST time series data
https://tape.readthedocs.io
MIT License
12 stars 3 forks source link

Make head() robust to empty partitions by default #380

Open dougbrn opened 8 months ago

dougbrn commented 8 months ago

With Dask join/filter workflows, empty partitions are a common occurence. Users will usually want to do some kind of head call to inspect their result, but this will often yield an empty dataframe due to only searching the first partition. This is confusing for new users who may now think they lost all their data. We should update head to account for potential empty partitions by default, and prioritize returning a result. We think a check_all=True kwarg is appropriate to add to our implementation, something like this:

# if check_all=True and we get an empty result, prints an info message and then calls head(-1) on the rest of the partitions 
def head(self, n, npartitions=1, check_all=True):
    result = super().head(n, npartitions)
    if not result and npartitions != -1 and check_all and npartitions < self.npartitions:
       print("The first npartitions were empty, checking remaining partitions...")
       result = super().partitions[npartitions : self.npartitions].head(n, -1)
    return result

Alternatively, head was recently added to LSDB, and the implementation is robust to empty_partitions, with the additional feature that it will search partition by partition until it fulfills the requisite number of rows. It may be best to just align with their implementation: https://github.com/astronomy-commons/lsdb/blob/12271382ee6953c32d4422f0e777d05c0d1bd8f0/src/lsdb/catalog/catalog.py#L70

Behavior should be aligned between EnsembleFrame.head and Ensemble.head