lvgig / tubular

Python package implementing transformers for pre processing steps for machine learning.
https://tubular.readthedocs.io/en/latest/index.html
BSD 3-Clause "New" or "Revised" License
38 stars 14 forks source link

Consistent args and generic tests for columns: str or list transformers #150

Open davidhopkinson26 opened 9 months ago

davidhopkinson26 commented 9 months ago

What? This is a ticket for homogenising and implementing the generic tests for the transformers which take columns: str or list. These makes up the majority of transformers. A list of the transformers can be read from the attached spreadsheet (see end of description).

Why? Have identified as part of scoping #138 and #68 that for usability it makes sense to keep some inconsistency in transformers columns arguments. However for majority of transformers where columns can be a string or list of column names we can create the generic testing framework discussed in #138 in place of testing a base transformer and then implicitly testing functionality by testing calls to super() methods in child classes.

Other transformers can be grouped together in another issue, made consistent and tested together separately (e.g transformers which take two column inputs, transformers which take only a list)

How? For all cases where it makes sense, ensure the argument defining which columns the transformer operates on is called 'columns' and is able to take a str or list input.

We will also include those which have columns: str, list or None. For now It seems best to remove the 'None' option as it is only included sporadically. (In future we may want to add in functionality to check column types and compatibility with transformer and run on all columns which pass these criteria for columns=None.)

Replace implementation tests in each transformer with a set of tests which loop over the identified transformers testing functionality defined in BaseTransformer.

Transformer Consistent Args Audit.xlsx

TommyMatthews commented 9 months ago

For reference, there are 29 transformers in the columns str or list group, 17 of which need None removing and 7 of which will need more significant refactoring.

TommyMatthews commented 7 months ago

Overview:

This details the steps required to refactor the testing in the tubular package so that we test behaviour, not implementation.

In order to do this, we have created a tree of inheritable test classes. Between them these test classes contain tests for the full suite of behaviours that will be inherited when you inherit from the transformer in question, allowing these behaviours to be tested explicitly, rather than implicitly through implementation tests such as test_super_transform_called .

Refactoring a test file to bring it into the new framework mainly consists of selecting the correct classes to inherit from when writing the unit tests. The inherited tests will all run automatically and test for the inherited behaviours, thus the only tests that need to explicitly be written into the test file are the tests specific to that transformer (e.g. expected output tests). The full process for bringing a transformer into this framework is detailed in the sub-page "Test refactor PR process".

When working through this checklist it is important to check that test classes have been written for the relevant inherited transformers before beginning the refactor of a child transformer, for example make sure that CappingTransformer has been done before it's child class OutOfRangeNullTransformer.

This work looks at every transformer and so provides a good opportunity to audit Tubular! If you see anything you think could be improve, or spot any redundancy or duplicity in the package, please raise a github issue and give David H a msg (smile)

Checklist (tickets to be added):

Base:

BaseTransformer DataFrameMethodTransformer

Capping:

CappingTransformrmer OutOfRangeNullTransformer (do Capping first!)

Comparison

EqualityChecker

Dates:

These need a bit more thought - BaseDateTransformer only provides helper functions - not obvious how to check that the desired behaviour is obtained in child classes without repeating test code.

BaseDateTransformer DateDiffLeapYearTransformer DateDifferenceTransformer ToDatetimeTransformer SeriesDtMethodTransformer BetweenDatesTransformer DatetimeInfoExtractor

Imputers:

BaseImputer ArbitraryImputer MedianImputer MeanImputer ModeImputer NearestMeanResponseImputer

Mapping:

BaseMappingTransformer BaseMappingTransformMixin MappingTransformer CrossColumnMappingTransformer CrossColumnMultiplyTransfornmer CrossColumnAddTransformer

Misc:

SetValueTransformer SetColumnDtype

Nominal

BaseNominalTransformer NominalToIntegerTransformer GroupRareLevelsTransformer MeanResponseTransformer OrdinalEncoderTransformer OneHotEncodingTransformer

Numeric

LogTransformer CutTransformer TwoColumnOperatorTransformer ScalingTransformer InteractionTransformer PCATransformer

Strings

SeriesStrMethodTransformer StringConcatenator

TommyMatthews commented 7 months ago

Test Refactor PR Process:

This is the procedure for ticking a single sub group of transformers from the checklist in the overview page. Included are example PRs for each category of transformer.

Contents:

Process overview: step by step approach Example PRs: example PRs covering the different scenarios Consistent arg standards Inheriting the appropriate test classes

Process overview:

For each transformer you need to:

Example PRs: (TO ADD)

Str or list: Two column: Columns from dict: Special case:

Consistent arg standards:

Columns str or list:

Transformers taking columns as a single string or list should have a columns and, where relevant, new_column_names arguments. The transformer should operate on each column in columns (performing the same operation each time). Remove None as an option for columns - we are no longer keeping this functionality.

Two column:

Transformers taking two columns should have an upper_column and lower_column argument - there is no need for operating on multiple pairs of columns at once at this stage.

From dict:

These transformers read the columns in from e.g. a mapping dictionary. They can be left as they are - they're already all consistent.

Other cases:

Bring them inline with the above as close as possible but don't worry too much.

Inheriting the appropriate test classes:

Some thought is required to inherit the appropriate test classes but if should be fairly clear by looking at the inheritance tree of the transformer you are testing - the only quirks are:

If the transformer inherits directly from BaseTransformer, you will have to pick the correct interface class given how the transformer takes the columns input. (e.g. ColumnsFromDictInitTests) If your transformer is not a direct child of BaseTransformer, but the intermediate class only affects the e.g. transform method, the transform class may inherit from the direct parent transform test class but the other methods will inherit from the BaseTransformer test classes.

TommyMatthews commented 7 months ago

Please make a Jira ticket for each transformer when you start work

limlam96 commented 5 months ago

Capturing state-of-play

NOTE: this work is now tracked by this milestone

Base: BaseTransformer DataFrameMethodTransformer UpperColumnLowerColumnInitTests PR

Capping: BaseCappingTransformer PR CappingTransformer PR OutOfRangeNullTransformer (do Capping first!) PR

Comparison: EqualityChecker PR

Dates (grouping these in larger issue initially as more complicated (Issue) ) Refactor module in prep for test refactors PR BaseDateTransformer #273 DateDiffLeapYearTransformer DateDifferenceTransformer ToDatetimeTransformer SeriesDtMethodTransformer BetweenDatesTransformer DatetimeInfoExtractor

Imputers:

BaseImputer ArbitraryImputer PR MedianImputer PR MeanImputer PR ModeImputer PR NearestMeanResponseImputer Issue

Mapping: BaseMappingTransformer BaseMappingTransformMixin MappingTransformer Base CrossColumnMapping Test Classes PR1, PR2 CrossColumnMappingTransformer PR CrossColumnMultiplyTransfornmer PR CrossColumnAddTransformer PR

Misc: SetValueTransformer PR SetColumnDtype PR

Nominal: BaseNominalTransformer PR NominalToIntegerTransformer Issue GroupRareLevelsTransformer Issue MeanResponseTransformer PR OrdinalEncoderTransformer Issue OneHotEncodingTransformer Issue

Numeric: LogTransformer PR CutTransformer #272 TwoColumnOperatorTransformer #274 ScalingTransformer #275 InteractionTransformer #276 PCATransformer #277

Strings: SeriesStrMethodTransformer #278 StringConcatenator #279

Total complete = 26/43

limlam96 commented 1 week ago

@davidhopkinson26 same Q here - I think this is now fully captured by individual issues, am I okay to just close?