lincc-frameworks / nested-dask

Connector project to enable Dask on Nested-Pandas
https://nested-dask.readthedocs.io/en/latest/
MIT License
5 stars 1 forks source link

`dropna` on_nested breaks when rows have duplicate indices. #60

Closed smcguire-cmu closed 1 week ago

smcguire-cmu commented 2 months ago

Bug report

This is something I noticed when we changed to _healpix_29 from _hipscat_index. With duplicate object indices, when doing dropna with on_nested the resulting data frame will combine the nested columns for all rows with the same index.

reproducing example:

import nested_pandas as npd
import numpy as np
import nested_dask as nd
import pandas as pd
import pyarrow as pa

a = npd.NestedFrame({"a": [1,2,3,np.NAN,5]}, index=[0,0,1,1,2])
b = npd.NestedFrame({"b": [1,2,3], "index": [0,1,1]}, index=[0,1,2])

ndf = b.add_nested(a, name="test")
ndf = ndf.set_index("index")
ndf.dropna(on_nested="test").iloc[1]["test"]

In this above case, before dropna is called, the ndf has two rows with index 1, and after the dropna calls the nested columns of these rows are combined and duplicated in both rows.

In lsdb, I got around this by doing a reset_index before dropna, then setting the index after.

Before submitting Please check the following:

wilsonbb commented 1 week ago

This looks fixed to me in the newest nested-* releases

Screenshot 2024-11-18 at 3 53 55 PM