modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.74k stars 651 forks source link

Improve the user facing DataFrame and Series constructor by hide the query_compiler parameter #7366

Open sfc-gh-yzou opened 1 month ago

sfc-gh-yzou commented 1 month ago

Today, the constructor for dataframe looks like the following

def __init__(
        self,
        data=None,
        index=None,
        columns=None,
        dtype=None,
        copy=None,
        query_compiler: BaseQueryCompiler = None,
    ) -> None:

which contains a parameter query_compiler, and occurs in the generated documentation. However, we don't really want user to use this parameter, it is just for our internal construction usage.

In pandas, they allow construction of dataframe/series directly from BlockManager class, and the BlockManager is a property on NDFrame class https://github.com/pandas-dev/pandas/blob/main/pandas/core/generic.py#L257. Pandas provides an internal method _from_mgr https://github.com/pandas-dev/pandas/blob/main/pandas/core/generic.py#L309 to allow construction directly from BlockManager. In the frontend, pandas actually handles when data is a BlockManager directly, but it is not documented in the API.

In modin, maybe we can do something similar, where we can push the query_compiler construction to the BasePandasDataset, and provide an internal constructer _from_query_compiler to allow direct creation of the class from query compiler. In that way, we will also be consistent with the pandas constructor definition.

sfc-gh-yzou commented 1 month ago

cc @devin-petersohn @noloerino Do you think we can potentially improve this user facing behavior?

noloerino commented 1 month ago

I would slightly prefer changing data to accept BaseQueryCompiler and performing an isinstance(data, BaseQueryCompiler) check, since we already do isinstance checks to see if the data is an array or other modin frontend object. I'd also be fine with adding a DataFrame/Series._from_query_compiler static method for internal use like you suggested.

sfc-gh-yzou commented 1 month ago

@noloerino we can let data to accept the query compiler interface, but if we do that, let's make sure the QueryCompiler doesn't occur in any of the user documentation, so that we can keep that completely hidden from the user.