opensearch-project / opensearch-py-ml

Apache License 2.0
30 stars 62 forks source link

[BUG] DataFrame.to_pandas generates duplicates #382

Open cbeaujoin-stellar opened 4 months ago

cbeaujoin-stellar commented 4 months ago

What is the bug?

DataFrame.to_pandas generates duplicates when an os_index_field is set and/or other than "_doc".

How can one reproduce the bug?

        opensearch_df = oml.DataFrame(client, index, columns=columns, os_index_field="@timestamp")
        index_df = opensearch_df.to_pandas(True)
        dup = index_df[index_df.duplicated(keep=False)]
        print(len(dup))

=> Loading index: 2024-03-01 16:02:58.179774: read 10000 rows 2024-03-01 16:03:07.520786: read 14895 rows 4930

What is the expected behavior? opensearch_py_ml/operations.py:1229

    def to_pandas(
        self, query_compiler: "QueryCompiler", show_progress: bool = False
    ) -> pd.DataFrame:
...
        for df in self.search_yield_pandas_dataframes(query_compiler=query_compiler):

search_yield_pandas_dataframes should be called with sort_indexparameter set to os_index_field value defined in the oml.DataFrame

What is your host/environment?

dhrubo-os commented 4 months ago

Hi @cbeaujoin-stellar, thanks for creating the issue. Please feel free to raise a PR if you want.

cbeaujoin-stellar commented 1 month ago

Any update ?