koaning / scikit-lego

Extra blocks for scikit-learn pipelines.
https://koaning.github.io/scikit-lego/
MIT License
1.28k stars 117 forks source link

`HierarchicalPredictor` and `HierarchicalTransformer` #620

Closed FBruzzesi closed 8 months ago

FBruzzesi commented 9 months ago

Description

As noted in #616 and discussed on private channels, GroupedPredictor class has its shortcomings.

I am opening this issue to keep track on how we could expand from what we learned. Here a list of features that HierarchicalPredictor and HierarchicalTransformer should have in my opinion:

  1. Given an input group, all levels should be fitted, including a global model
    • Example: groups = ["a", "b"], groups to fit are: ["global"], ["global", "a"], ["global", "a", "b"]
    • This implies that now the order always matters, not only when shrinkage is not None.
    • This is quite opinionated, but also guarantees that 3.1 (see later) always works.
  2. Shrinkage available for both regression and classification tasks (tbd how it could be extended to outlier detection)
  3. Fallback strategy to be either "next" or "raise", meaning:
    • "next": if a group value is not found at prediction/transformation time, it fallbacks to the first available group in the hierarchy.
    • "raise": if a group value is not found at prediction/transformation time, an error is raised.
    • ~Disclaimer: better value instead of "next" can be found 😂~ what about "parent"?
  4. Allows for extra params in fit (e.g. sample_weight), which is currently not possible.

A first draft of the implementation was developed in #618, however trying to keep the current behavior of GroupedPredictor as well as expanding its functionalities was going to become a headache both to develop and maintain. This is why an implementation from scratch would make sense.

Edit

A few considerations regarding the hypothetical HierarchicalTransformer:

koaning commented 9 months ago

Shrinkage doesn't really make sense, or at least not for all transformers. In fact the current GroupedTransformer doesn't support it.

Agree!

Should the group columns be returned as they are from the .transform(X) operation?

It feels like it would be fine to default to not returning the group columns. Mainly because users could select them seperately in the pipeline if they were so inclined. Might make for a good parameter though.

Should the .transform(X) output maintain the input type? Hence if pandas dataframe maintain the index as well?

I would prefer not to rely on anything pandas specific with the advent of possible polars support. Is there a use-case you had in mind where that's required?

FBruzzesi commented 9 months ago

Ok, these will definitely be (at least) two separate PRs, one for predictor and one for transformer (the predictor one could land late this week 😁)

Might make for a good parameter though

"passthrough" vs "drop" seems suitable

Is there a use-case you had in mind where that's required?

Absolutely not, I started to rely on indexes as less as possible, but since pandas and polars input/output are supported by scikit-learn, I am wondering what is the behavior one should expect.