modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.58k stars 647 forks source link

Interoperability between DataFrames using different query compilers #7308

Open YarShev opened 2 weeks ago

YarShev commented 2 weeks ago

Originally posted by @arunjose696 in #7259.

With the introduction of the small query compiler, we need to test the interoperability between DataFrames using different query compilers. For example, performing a binary operation between a DataFrame with the small query compiler and another with the Pandas query compiler. (Note: This feature is not yet included in this PR.)

This will require modifying or adding new tests. In the current tests in the modin/modin/tests/pandas/dataframe folder, we have the following scenarios where two DataFrames interact:

1)Derived DataFrames: In tests where the second DataFrame is created or derived from the first, egtest_join_empty, we need to refactor these tests so that the second DataFrame is created separately from the first and with MODIN_NATIVE_DATAFRAME_MODE set.

2)Lambda Functions: In tests where the other DataFrame is created within a lambda function, eg test_divmod, we need to refactor these tests to either create the second DataFrame in the test definition itself or provide an additional wrapper for the lambda functions to ensure the DataFrame is created with a different query compilers.

3)Separate DataFrames: In tests where two separate DataFrames are used, eg test_where, we need to refactor these tests to include flipping the MODIN_NATIVE_DATAFRAME_MODE to None and Native_pandas when creating both the first and second DataFrame. This ensures that both the left and right operands are tested with different query compilers for interoperability. This flipping would also be required in cases mentioned in 1 and 2 after dataframes are separated.

Upon reviewing the modin/modin/tests/pandas/dataframe folder, I found approximately 100 tests involving scenarios where two DataFrames interact. These tests may need refactoring or copying to a different directory and updating to specifically test interoperability.

@YarShev @anmyachev @devin-petersohn, could you please provide suggestions on how to approach testing the interoperability?

YarShev commented 2 weeks ago

@arunjose696, thanks for your research. I think we should copy those tests to a different directory (e.g., modin/tests/pandas/native_df_mode) and update them to specifically test interoperability. This way, we would not bloat up existing tests and would make navigation for interoperability tests easier.

devin-petersohn commented 2 weeks ago

In terms of testing, I think unit tests are the way to go. We don't need to test every combination of APIs, as long as the conversion is working properly. We can add some canary testing on one or two APIs to ensure that end to end is working properly. Does this make sense?

YarShev commented 2 weeks ago

@devin-petersohn, thanks for the suggestion! It does make sense. We can start with unit tests to verify if API works with a single set of parameters.

arunjose696 commented 2 weeks ago

After the first implementation of small QC is done, I will open a PR with interoperablilty and have unit tests to verify if the API works with single set of parameters.

For the first implementation https://github.com/modin-project/modin/pull/7259., would it suffice to go with tests in modin/modin/tests/pandas/dataframe folder for now by setting the MODIN_NATIVE_DATAFRAME_MODE, to verify the query compiler works for dataframes, or should we add unit tests even for the initial implementation?

YarShev commented 2 weeks ago

It seems to me we could have unit tests even for the first impl. Just copy tests from the dataframe folder to another one and leave a single set of parameters for every tests. @devin-petersohn, what do you think?

devin-petersohn commented 1 week ago

That makes sense to me. Thanks @YarShev and @arunjose696 !