scikit-learn-contrib / skdag

A more flexible alternative to scikit-learn Pipelines
MIT License
30 stars 8 forks source link

GridSearch and skdag #32

Open TonciG opened 6 months ago

TonciG commented 6 months ago

Hi,

First of all, I think your library is a great add on to sklearn, especially since it addresses limitations of Pipeline.

Having said that, I tried to use skdag with GridSearchCV of sklearn but run into problem. I try to use one of your examples from the library docs (https://skdag.readthedocs.io/en/latest/quick_start.html) to do the grid search of optimal hyperparameter values. To you code I only add the following: from sklearn.model_selection import GridSearchCV params = {'blood__n_components': [1,2,3,4]} grid = GridSearchCV(estimator = dag2, param_grid = params, scoring = 'accuracy') grid.fit(X_train, y_train)

However, when I try to fit the model, I get the following error: ValueError: Found input variables with inconsistent numbers of samples: [61, 2]

Would really appreciate if you could tell me what is going on here. Regards, Tonci

big-o commented 2 months ago

Hi, can you share your full code? There's no X_train in the quick start guide so it's hard to recreate the issue from from what you've included here.

TonciG commented 2 months ago

Hi, Thank you for your response. Below is the full code:

from sklearn import datasets from sklearn.model_selection import train_test_split from skdag import DAGBuilder from sklearn.decomposition import PCA from sklearn.impute import SimpleImputer from sklearn.linear_model import LogisticRegression

X, y = datasets.load_diabetes(return_X_y=True, as_frame=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

dag = ( DAGBuilder(infer_dataframe=True) .add_step("impute", SimpleImputer()) .add_step("vitals", "passthrough", deps={"impute": ["age", "sex", "bmi", "bp"]}) .add_step("blood", PCA(n_components=2, random_state=0), deps={"impute": slice(4, 10)}) .add_step("lr", LogisticRegression(random_state=0), deps=["blood", "vitals"]) .make_dag() )

from sklearn.ensemble import RandomForestClassifier cal = DAGBuilder(infer_dataframe=True).from_pipeline( [("rf", RandomForestClassifier(random_state=0))] ).make_dag() dag2 = dag.join(cal, edges=[("blood", "rf"), ("vitals", "rf")])

y_pred = dag2.fit_predict(X_train, y_train) type(y_pred)

from sklearn.model_selection import GridSearchCV

params = {'blood__n_components': [1,2,3,4]} grid = GridSearchCV(estimator = dag2, param_grid = params, scoring = 'accuracy') grid.fit(X_train, y_train)

Regards, Tonci