Open wilsonbb opened 1 year ago
An example of the implications involved is that there are different ways to build an index on the same column with the same id.
So the user could call
source_frame.set_index("id", sort=False)
while in Ensemble.from_dask_dataframe
we would set
source_frame.set_index("id")
Note that sort=True
is the default, so given the above warning in the current code we would still use the original unsorted index since they share the same name.
Quick notes on priority:
This seems relatively low priority and low effort.
I feel my thoughts on this issue have changed: we now probably only want to call set_index
in from_dask_dataframe
if indices on the id column haven't been set. Otherwise we just always use the provided index (as is the case with dataframes from LSDB) and don't give the user an unneeded warning message.
To address the case in the previous comment, if there is a present index on the id column and the ensemble was constructed with sorted=False
and sort=True
, we might want to have from_dask_dataframe
call reset_index
and then set_index("id", sort=True)
.
Currently if
Ensemble.from_dask_dataframe
loads a dataframe with an index with the same label as the column mapper, the following warning is produced:dask/dataframe/core.py:5251: UserWarning: New index has same name as existing, this is a no-op.
Since computing indices is expensive, this is probably often a good thing. However it may be useful to allow users to choose whether or not an existing index should be overwritten. Alternatively we might just decide to clarify the documentation here.