lvgig / tubular

Python package implementing transformers for pre processing steps for machine learning.
https://tubular.readthedocs.io/en/latest/index.html
BSD 3-Clause "New" or "Revised" License
38 stars 14 forks source link

Testing strategy for Narwhalified transformers #318

Open davidhopkinson26 opened 1 week ago

davidhopkinson26 commented 1 week ago

What

Identify the appropriate testing strategy for transformers which use narwhals for pandas/polars compatibility.

Why?

We currently test using pandas dataframes as inputs across the package. The behaviour this asserts is to be preserved with the shift to using Narwhals/ polars but we will additionally need to check expected outputs for polars dataframes.

This is a spike ticket to investigate the most appropriate way to do this.

How?

Do we need to test both or is this adequately covered by the narwhals testing suite?

If we do need to test both how can this be done in such a way that development overhead is minimised?

Can we simply parameterise expected outputs tests with polars/pandas dataframes?

limlam96 commented 1 week ago

some initial thoughts:

I can POC this on the open BaseTransformer PR and get feedback

limlam96 commented 1 week ago

that's up here https://github.com/lvgig/tubular/pull/328

davidhopkinson26 commented 1 week ago

some initial thoughts:

  • can use pytest.mark.parametrise to pass an arg to minimal_dataframe_lookup in order to test both polars/pandas
  • may need to parametrise all test_data functions in the same way
  • and tests that interact with dfs, e.g. inserting bad values to raise errors, would need to be rewritten in narwhals to be flexible to both inputs

Think we will definitely need to do this last point at minimum. Possible one could make a case that unit testing of the to/from_native funcitons in polars would be enough to consider the pandas case covered if we unit test behaviour with polars inputs. Don't think the same argument can be made for the other way round.

Ideal case is to natively unit test functionality with pandas and polars but wary of the overhead this could create. POC will hopefully be quite revealing!