related-sciences / nxontology-ml

Machine learning to classify ontology nodes
Apache License 2.0
6 stars 0 forks source link

Refactor existing code to use Sklearn's pipeline feature #10

Closed yonromai closed 1 year ago

yonromai commented 1 year ago

As a matter of personal preference: I find sklearn's Pipeline feature very convenient for experimenting with features and models.

This PR refactors existing features and models to comply with Sklearn's pipeline feature.

(@dhimmel: This PR is part of the text embedding code change)

ravwojdyla commented 1 year ago

🔥 stuff as always!

+1 to sklearn's Pipeline!

I can see that you are splitting categorical and numerical features and store them in separate pandas DataFrame. I don't know how important it is for you to have them separate, but you may be interested in:

from sklearn.pipeline import Pipeline

class PandasPipeline(Pipeline):
    """
    Pipeline that returns a pandas dataframe instead of a numpy array.

    NOTE: https://github.com/scikit-learn/scikit-learn/issues/25287
    """

    def __init__(self, steps: Any, *, memory: Any = None, verbose: Any = False):
        super().__init__(steps, memory=memory, verbose=verbose)
        self.set_output(transform="pandas")
yonromai commented 1 year ago

Cool! I'll take a look. I mostly separate num & categorical features because CatBoost handles them separately so it's convenient to simply separate them when adding features.

The single DataFrame approach might turn out simpler & cleaner!

yonromai commented 1 year ago
class PandasPipeline(Pipeline):

I looked into this a little: I actually would like to use something like this ^ in the future but I might delay it a little for the following 2 reasons:

@ravwojdyla if you feel this is too hacky, I can definitely refactor the pipeline to pass a single dataframe and Union the features builders (instead of chaining them).