With Dask join/filter workflows, empty partitions are a common occurence. Users will usually want to do some kind of head call to inspect their result, but this will often yield an empty dataframe due to only searching the first partition. This is confusing for new users who may now think they lost all their data. We should update head to account for potential empty partitions by default, and prioritize returning a result. We think a check_all=True kwarg is appropriate to add to our implementation, something like this:
# if check_all=True and we get an empty result, prints an info message and then calls head(-1) on the rest of the partitions
def head(self, n, npartitions=1, check_all=True):
result = super().head(n, npartitions)
if not result and npartitions != -1 and check_all and npartitions < self.npartitions:
print("The first npartitions were empty, checking remaining partitions...")
result = super().partitions[npartitions : self.npartitions].head(n, -1)
return result
With Dask join/filter workflows, empty partitions are a common occurence. Users will usually want to do some kind of head call to inspect their result, but this will often yield an empty dataframe due to only searching the first partition. This is confusing for new users who may now think they lost all their data. We should update head to account for potential empty partitions by default, and prioritize returning a result. We think a
check_all=True
kwarg is appropriate to add to our implementation, something like this:Alternatively, head was recently added to LSDB, and the implementation is robust to empty_partitions, with the additional feature that it will search partition by partition until it fulfills the requisite number of rows. It may be best to just align with their implementation: https://github.com/astronomy-commons/lsdb/blob/12271382ee6953c32d4422f0e777d05c0d1bd8f0/src/lsdb/catalog/catalog.py#L70
Behavior should be aligned between EnsembleFrame.head and Ensemble.head