Open davidhopkinson26 opened 9 months ago
For reference, there are 29 transformers in the columns str or list group, 17 of which need None removing and 7 of which will need more significant refactoring.
Overview:
This details the steps required to refactor the testing in the tubular package so that we test behaviour, not implementation.
In order to do this, we have created a tree of inheritable test classes. Between them these test classes contain tests for the full suite of behaviours that will be inherited when you inherit from the transformer in question, allowing these behaviours to be tested explicitly, rather than implicitly through implementation tests such as test_super_transform_called .
Refactoring a test file to bring it into the new framework mainly consists of selecting the correct classes to inherit from when writing the unit tests. The inherited tests will all run automatically and test for the inherited behaviours, thus the only tests that need to explicitly be written into the test file are the tests specific to that transformer (e.g. expected output tests). The full process for bringing a transformer into this framework is detailed in the sub-page "Test refactor PR process".
When working through this checklist it is important to check that test classes have been written for the relevant inherited transformers before beginning the refactor of a child transformer, for example make sure that CappingTransformer has been done before it's child class OutOfRangeNullTransformer.
This work looks at every transformer and so provides a good opportunity to audit Tubular! If you see anything you think could be improve, or spot any redundancy or duplicity in the package, please raise a github issue and give David H a msg (smile)
Checklist (tickets to be added):
Base:
BaseTransformer DataFrameMethodTransformer
Capping:
CappingTransformrmer OutOfRangeNullTransformer (do Capping first!)
Comparison
EqualityChecker
Dates:
These need a bit more thought - BaseDateTransformer only provides helper functions - not obvious how to check that the desired behaviour is obtained in child classes without repeating test code.
BaseDateTransformer DateDiffLeapYearTransformer DateDifferenceTransformer ToDatetimeTransformer SeriesDtMethodTransformer BetweenDatesTransformer DatetimeInfoExtractor
Imputers:
BaseImputer ArbitraryImputer MedianImputer MeanImputer ModeImputer NearestMeanResponseImputer
Mapping:
BaseMappingTransformer BaseMappingTransformMixin MappingTransformer CrossColumnMappingTransformer CrossColumnMultiplyTransfornmer CrossColumnAddTransformer
Misc:
SetValueTransformer SetColumnDtype
Nominal
BaseNominalTransformer NominalToIntegerTransformer GroupRareLevelsTransformer MeanResponseTransformer OrdinalEncoderTransformer OneHotEncodingTransformer
Numeric
LogTransformer CutTransformer TwoColumnOperatorTransformer ScalingTransformer InteractionTransformer PCATransformer
Strings
SeriesStrMethodTransformer StringConcatenator
Test Refactor PR Process:
This is the procedure for ticking a single sub group of transformers from the checklist in the overview page. Included are example PRs for each category of transformer.
Contents:
Process overview: step by step approach Example PRs: example PRs covering the different scenarios Consistent arg standards Inheriting the appropriate test classes
Process overview:
For each transformer you need to:
Example PRs: (TO ADD)
Str or list: Two column: Columns from dict: Special case:
Consistent arg standards:
Columns str or list:
Transformers taking columns as a single string or list should have a columns and, where relevant, new_column_names arguments. The transformer should operate on each column in columns (performing the same operation each time). Remove None as an option for columns - we are no longer keeping this functionality.
Two column:
Transformers taking two columns should have an upper_column and lower_column argument - there is no need for operating on multiple pairs of columns at once at this stage.
From dict:
These transformers read the columns in from e.g. a mapping dictionary. They can be left as they are - they're already all consistent.
Other cases:
Bring them inline with the above as close as possible but don't worry too much.
Inheriting the appropriate test classes:
Some thought is required to inherit the appropriate test classes but if should be fairly clear by looking at the inheritance tree of the transformer you are testing - the only quirks are:
If the transformer inherits directly from BaseTransformer, you will have to pick the correct interface class given how the transformer takes the columns input. (e.g. ColumnsFromDictInitTests) If your transformer is not a direct child of BaseTransformer, but the intermediate class only affects the e.g. transform method, the transform class may inherit from the direct parent transform test class but the other methods will inherit from the BaseTransformer test classes.
Please make a Jira ticket for each transformer when you start work
Capturing state-of-play
NOTE: this work is now tracked by this milestone
Base:
BaseTransformer
DataFrameMethodTransformer
UpperColumnLowerColumnInitTests PR
Capping:
BaseCappingTransformer PR
CappingTransformer PR
OutOfRangeNullTransformer (do Capping first!) PR
Comparison:
EqualityChecker PR
Dates (grouping these in larger issue initially as more complicated (Issue) )
Refactor module in prep for test refactors PR
BaseDateTransformer #273
DateDiffLeapYearTransformer
DateDifferenceTransformer
ToDatetimeTransformer
SeriesDtMethodTransformer
BetweenDatesTransformer
DatetimeInfoExtractor
Imputers:
BaseImputer
ArbitraryImputer PR
MedianImputer PR
MeanImputer PR
ModeImputer PR
NearestMeanResponseImputer Issue
Mapping:
BaseMappingTransformer
BaseMappingTransformMixin
MappingTransformer
Base CrossColumnMapping Test Classes PR1, PR2
CrossColumnMappingTransformer PR
CrossColumnMultiplyTransfornmer PR
CrossColumnAddTransformer PR
Misc:
SetValueTransformer PR
SetColumnDtype PR
Nominal:
BaseNominalTransformer PR
NominalToIntegerTransformer Issue
GroupRareLevelsTransformer Issue
MeanResponseTransformer PR
OrdinalEncoderTransformer Issue
OneHotEncodingTransformer Issue
Numeric:
LogTransformer PR
CutTransformer #272
TwoColumnOperatorTransformer #274
ScalingTransformer #275
InteractionTransformer #276
PCATransformer #277
Strings:
SeriesStrMethodTransformer #278
StringConcatenator #279
Total complete = 26/43
@davidhopkinson26 same Q here - I think this is now fully captured by individual issues, am I okay to just close?
What? This is a ticket for homogenising and implementing the generic tests for the transformers which take columns: str or list. These makes up the majority of transformers. A list of the transformers can be read from the attached spreadsheet (see end of description).
Why? Have identified as part of scoping #138 and #68 that for usability it makes sense to keep some inconsistency in transformers columns arguments. However for majority of transformers where columns can be a string or list of column names we can create the generic testing framework discussed in #138 in place of testing a base transformer and then implicitly testing functionality by testing calls to super() methods in child classes.
Other transformers can be grouped together in another issue, made consistent and tested together separately (e.g transformers which take two column inputs, transformers which take only a list)
How? For all cases where it makes sense, ensure the argument defining which columns the transformer operates on is called 'columns' and is able to take a str or list input.
We will also include those which have columns: str, list or None. For now It seems best to remove the 'None' option as it is only included sporadically. (In future we may want to add in functionality to check column types and compatibility with transformer and run on all columns which pass these criteria for columns=None.)
Replace implementation tests in each transformer with a set of tests which loop over the identified transformers testing functionality defined in BaseTransformer.
Transformer Consistent Args Audit.xlsx