This is something I noticed when we changed to _healpix_29 from _hipscat_index. With duplicate object indices, when doing dropna with on_nested the resulting data frame will combine the nested columns for all rows with the same index.
reproducing example:
import nested_pandas as npd
import numpy as np
import nested_dask as nd
import pandas as pd
import pyarrow as pa
a = npd.NestedFrame({"a": [1,2,3,np.NAN,5]}, index=[0,0,1,1,2])
b = npd.NestedFrame({"b": [1,2,3], "index": [0,1,1]}, index=[0,1,2])
ndf = b.add_nested(a, name="test")
ndf = ndf.set_index("index")
ndf.dropna(on_nested="test").iloc[1]["test"]
In this above case, before dropna is called, the ndf has two rows with index 1, and after the dropna calls the nested columns of these rows are combined and duplicated in both rows.
In lsdb, I got around this by doing a reset_index before dropna, then setting the index after.
Before submitting
Please check the following:
[x] I have described the situation in which the bug arose, including what code was executed, information about my environment, and any applicable data others will need to reproduce the problem.
[x] I have included available evidence of the unexpected behavior (including error messages, screenshots, and/or plots) as well as a descriprion of what I expected instead.
[x] If I have a solution in mind, I have provided an explanation and/or pseudocode and/or task list.
Bug report
This is something I noticed when we changed to
_healpix_29
from_hipscat_index
. With duplicate object indices, when doingdropna
withon_nested
the resulting data frame will combine the nested columns for all rows with the same index.reproducing example:
In this above case, before dropna is called, the ndf has two rows with index 1, and after the dropna calls the nested columns of these rows are combined and duplicated in both rows.
In lsdb, I got around this by doing a reset_index before dropna, then setting the index after.
Before submitting Please check the following: