scikit-learn-contrib / skdag

A more flexible alternative to scikit-learn Pipelines
MIT License
30 stars 8 forks source link

How to get the internally-computed node outputs to be part of the final output? #30

Closed galenseilis closed 10 months ago

galenseilis commented 10 months ago

I am trying to understand how to get skdag to return all the computed columns when the predict method is called.

Here is an example from the documentation:

from skdag import DAGBuilder
from sklearn.compose import make_column_selector
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

dag = (
    DAGBuilder(infer_dataframe=True)
    .add_step(
        "impute",
        SimpleImputer()
        )
    .add_step(
        "vitals",
        "passthrough",
        deps={"impute": ["age", "sex", "bmi", "bp"]}
        )
    .add_step(
        "blood",
        PCA(n_components=2, random_state=0),
        deps={"impute": make_column_selector("s[0-9]+")}
        )
    .add_step(
        "lr",
        LogisticRegression(random_state=0),
        deps=["blood", "vitals"]
        )
    .make_dag()
)
dag.show()

from sklearn import datasets
X, y = datasets.load_diabetes(return_X_y=True, as_frame=True)
dag.fit_predict(X, y)

I tried just sticking an identity function on the end to collect the results, but it didn't work. I do not understand how things get passed along internally.

from skdag import DAGBuilder
from sklearn.compose import make_column_selector
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import FunctionTransformer

dag = (
    DAGBuilder(infer_dataframe=True)
    .add_step(
        "impute",
        SimpleImputer()
        )
    .add_step(
        "vitals",
        "passthrough",
        deps={"impute": ["age", "sex", "bmi", "bp"]}
        )
    .add_step(
        "blood",
        PCA(n_components=2, random_state=0),
        deps={"impute": make_column_selector("s[0-9]+")}
        )
    .add_step(
        "lr",
        LogisticRegression(random_state=0),
        deps=["blood", "vitals"]
        )
    .add_step(
        "out",
        FunctionTransformer(lambda x: x),
        deps=["inpute", "blood", "vitals", "lr"]
        )
    .make_dag()
)
dag.show()

from sklearn import datasets
X, y = datasets.load_diabetes(return_X_y=True, as_frame=True)
dag.fit_predict(X, y)

Here the traceback I got. It suggested some kind of "inconsistency" in what I have coded.

Traceback (most recent call last):
  File "/usr/lib/python3.10/idlelib/run.py", line 578, in runcode
    exec(code, self.locals)
  File "/home/galen/Dropbox/bin/try_skdag.py", line 9, in <module>
    DAGBuilder(infer_dataframe=True)
  File "/home/galen/.local/lib/python3.10/site-packages/skdag/dag/_builder.py", line 120, in add_step
    self._validate_deps(deps)
  File "/home/galen/.local/lib/python3.10/site-packages/skdag/dag/_builder.py", line 158, in _validate_deps
    raise ValueError(f"unresolvable dependencies: {', '.join(sorted(missing))}")
ValueError: unresolvable dependencies: inpute
big-o commented 10 months ago

Looks like a typo in your step names: inpute vs impute.

The output from your dag will be the output of the final step, or if there are multiple endpoints then it will be a dict of step names to step outputs.

Will this give you what you need? I'm not sure I fully understand what you're trying to achieve.

galenseilis commented 10 months ago

Ah, thank you for spotting the typo.

Running with the typo corrected I get the traceback,

Traceback (most recent call last):
  File "/usr/lib/python3.10/idlelib/run.py", line 578, in runcode
    exec(code, self.locals)
  File "/home/galen/skdag_test.py", line 41, in <module>
    dag.fit_predict(X, y)
  File "/home/galen/.local/lib/python3.10/site-packages/sklearn/utils/_available_if.py", line 31, in __get__
    if not self.check(obj):
  File "/home/galen/.local/lib/python3.10/site-packages/skdag/dag/_dag.py", line 103, in check_leaves
    raise AttributeError(
AttributeError: <class 'sklearn.preprocessing._function_transformer.FunctionTransformer'> object(s) has no attribute 'predict'

I guess I could subclass FunctionTransformer to have a predict method.

At this early stage I am just trying to figure more about how the package works.