Allow User to Force Ensemble.from_dask_dataframe to Overwrite Existing Indices

lincc-frameworks / tape

[Deprecated] Package for working with LSST time series data

https://tape.readthedocs.io

MIT License

12 stars 3 forks source link

Allow User to Force Ensemble.from_dask_dataframe to Overwrite Existing Indices #232

Open wilsonbb opened 1 year ago

wilsonbb commented 1 year ago

Currently if Ensemble.from_dask_dataframe loads a dataframe with an index with the same label as the column mapper, the following warning is produced:

dask/dataframe/core.py:5251: UserWarning: New index has same name as existing, this is a no-op.

Since computing indices is expensive, this is probably often a good thing. However it may be useful to allow users to choose whether or not an existing index should be overwritten. Alternatively we might just decide to clarify the documentation here.

wilsonbb commented 1 year ago

An example of the implications involved is that there are different ways to build an index on the same column with the same id.

So the user could call

source_frame.set_index("id", sort=False)

while in Ensemble.from_dask_dataframe we would set

source_frame.set_index("id")

Note that sort=True is the default, so given the above warning in the current code we would still use the original unsorted index since they share the same name.

wilsonbb commented 10 months ago

Quick notes on priority:

This seems relatively low priority and low effort.

I feel my thoughts on this issue have changed: we now probably only want to call set_index in from_dask_dataframe if indices on the id column haven't been set. Otherwise we just always use the provided index (as is the case with dataframes from LSDB) and don't give the user an unneeded warning message.

To address the case in the previous comment, if there is a present index on the id column and the ensemble was constructed with sorted=False and sort=True, we might want to have from_dask_dataframe call reset_index and then set_index("id", sort=True).