Refactor existing code to use Sklearn's pipeline feature

yonromai commented 1 year ago

As a matter of personal preference: I find sklearn's Pipeline feature very convenient for experimenting with features and models.

This PR refactors existing features and models to comply with Sklearn's pipeline feature.

(@dhimmel: This PR is part of the text embedding code change)

ravwojdyla commented 1 year ago

🔥 stuff as always!

+1 to sklearn's Pipeline!

I can see that you are splitting categorical and numerical features and store them in separate pandas DataFrame. I don't know how important it is for you to have them separate, but you may be interested in:

from sklearn.pipeline import Pipeline

class PandasPipeline(Pipeline):
    """
    Pipeline that returns a pandas dataframe instead of a numpy array.

    NOTE: https://github.com/scikit-learn/scikit-learn/issues/25287
    """

    def __init__(self, steps: Any, *, memory: Any = None, verbose: Any = False):
        super().__init__(steps, memory=memory, verbose=verbose)
        self.set_output(transform="pandas")

yonromai commented 1 year ago

Cool! I'll take a look. I mostly separate num & categorical features because CatBoost handles them separately so it's convenient to simply separate them when adding features.

The single DataFrame approach might turn out simpler & cleaner!

yonromai commented 1 year ago

class PandasPipeline(Pipeline):

I looked into this a little: I actually would like to use something like this ^ in the future but I might delay it a little for the following 2 reasons:

The beginning of the pipeline takes a List[str] as an input (ids of the graph nodes) and ends with catboost.FeaturesData (code) - which is the input to the catboost model.
For the sake of simplicity, I not only pass 2 dataframes but also the original node inputs (code) to all the steps of the pipeline. It is slightly hacky but allows each feature builder to pick what it needs for the nodes (and possibly previously computed features). I think eventually a pipeline that relies only on a union of dataframe based features will be cleaner.

@ravwojdyla if you feel this is too hacky, I can definitely refactor the pipeline to pass a single dataframe and Union the features builders (instead of chaining them).

related-sciences / nxontology-ml

Refactor existing code to use Sklearn's pipeline feature #10