Open cbeaujoin-stellar opened 4 months ago
What is the bug?
DataFrame.to_pandas generates duplicates when an os_index_field is set and/or other than "_doc".
os_index_field
How can one reproduce the bug?
opensearch_df = oml.DataFrame(client, index, columns=columns, os_index_field="@timestamp") index_df = opensearch_df.to_pandas(True) dup = index_df[index_df.duplicated(keep=False)] print(len(dup))
=> Loading index: 2024-03-01 16:02:58.179774: read 10000 rows 2024-03-01 16:03:07.520786: read 14895 rows 4930
What is the expected behavior? opensearch_py_ml/operations.py:1229
def to_pandas( self, query_compiler: "QueryCompiler", show_progress: bool = False ) -> pd.DataFrame: ... for df in self.search_yield_pandas_dataframes(query_compiler=query_compiler):
search_yield_pandas_dataframes should be called with sort_indexparameter set to os_index_field value defined in the oml.DataFrame
search_yield_pandas_dataframes
sort_index
What is your host/environment?
Hi @cbeaujoin-stellar, thanks for creating the issue. Please feel free to raise a PR if you want.
Any update ?
What is the bug?
DataFrame.to_pandas generates duplicates when an
os_index_field
is set and/or other than "_doc".How can one reproduce the bug?
=> Loading index: 2024-03-01 16:02:58.179774: read 10000 rows 2024-03-01 16:03:07.520786: read 14895 rows 4930
What is the expected behavior? opensearch_py_ml/operations.py:1229
search_yield_pandas_dataframes
should be called withsort_index
parameter set toos_index_field
value defined in the oml.DataFrameWhat is your host/environment?