modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.59k stars 648 forks source link

FEAT-#4605: Adding small query compiler #7259

Open arunjose696 opened 1 month ago

arunjose696 commented 1 month ago

What do these changes do?

arunjose696 commented 1 month ago

Great start on solving this problem! Is it possible to avoid so many of the test changes?

The most changes in tests are disabling few checks as it wont be supported without partitions, and as the current changes dont yet support IO like pd.read_csv(), Is there something specific that should be avoided?

devin-petersohn commented 1 month ago

is there something specific that should be avoided?

Nothing specific, I was just trying to understand context. Thanks!

anmyachev commented 3 weeks ago

@arunjose696 please rebase on main

arunjose696 commented 2 weeks ago

With the introduction of the small query compiler, we need to test the interoperability between DataFrames using different query compilers. For example, performing a binary operation between a DataFrame with the small query compiler and another with the Pandas query compiler. (Note: This feature is not yet included in this PR.)

This will require modifying or adding new tests. In the current tests in the modin/modin/tests/pandas/dataframe folder, we have the following scenarios where two DataFrames interact:

1)Derived DataFrames: In tests where the second DataFrame is created or derived from the first, egtest_join_empty, we need to refactor these tests so that the second DataFrame is created separately from the first and with MODIN_NATIVE_DATAFRAME_MODE set.

2)Lambda Functions: In tests where the other DataFrame is created within a lambda function, eg test_divmod, we need to refactor these tests to either create the second DataFrame in the test definition itself or provide an additional wrapper for the lambda functions to ensure the DataFrame is created with a different query compilers.

3)Separate DataFrames: In tests where two separate DataFrames are used, eg test_where, we need to refactor these tests to include flipping the MODIN_NATIVE_DATAFRAME_MODE to None and Native_pandas when creating both the first and second DataFrame. This ensures that both the left and right operands are tested with different query compilers for interoperability. This flipping would also be required in cases mentioned in 1 and 2 after dataframes are separated.

Upon reviewing the modin/modin/tests/pandas/dataframe folder, I found approximately 100 tests involving scenarios where two DataFrames interact. These tests may need refactoring or copying to a different directory and updating to specifically test interoperability.

@YarShev @anmyachev @devin-petersohn, could you please provide suggestions on how to approach testing the interoperability?