microsoft / NimbusML

Python machine learning package providing simple interoperability between ML.NET and scikit-learn components.
Other
282 stars 63 forks source link

CV creates incorrect split of user defined transforms. #409

Open pieths opened 4 years ago

pieths commented 4 years ago

When specifying split_start='after_transforms' in CV.fit(), the user defined transforms are not split up correctly. See the graph created by the fit() call in the code below.

It seems like if a user defined transform has presteps then the split location will not be in the right place. This might also effect splitting the transforms given an integer value.

from nimbusml import DataSchema, FileDataStream
from nimbusml.datasets import get_dataset
from nimbusml.ensemble import LightGbmRegressor
from nimbusml.model_selection import CV
from nimbusml.preprocessing.missing_values import Indicator, Handler

path = get_dataset("airquality").as_filepath()
schema = DataSchema.read_schema(path)
data = FileDataStream(path, schema)

pipeline_steps = [
    Indicator() << {
        'Ozone_ind': 'Ozone',
        'Solar_R_ind': 'Solar_R'},
    Handler(
        replace_with='Mean') << {
        'Solar_R': 'Solar_R',
        'Ozone': 'Ozone'},
    LightGbmRegressor(
        feature=['Ozone',
                 'Solar_R',
                 'Ozone_ind',
                 'Solar_R_ind',
                 'Temp'],
        label='Wind')]

cv_results = CV(pipeline_steps).fit(data, split_start='after_transforms')
pieths commented 4 years ago

Commit d5c7c828ef820d681e2cf5e38568177200cb3b3c resolves the issue with split_start='after_transforms' but it does not fix the issue when the user specifies an integer index as the split_start value.

When a transform has presteps then the integer index the user specified will not correspond to the index of the transform in the pipeline.