scikit-learn-contrib / skdag

A more flexible alternative to scikit-learn Pipelines
MIT License
30 stars 8 forks source link

Question: Why multiple instances of nx.Digraph in DAG? #29

Closed galenseilis closed 10 months ago

galenseilis commented 10 months ago

I'm really excited by skdag. I was working on a similar project when I realized that it solved all the problems I had or wanted to solve.

I am currently trying to build something on top of this which has access to the underyling dag structure. I've encountered an ambiguity I am hoping for technical assistance with.

Suppose I begin with this example from the docs loaded in memory:

from skdag import DAGBuilder
from sklearn.compose import make_column_selector
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

dag = (
    DAGBuilder(infer_dataframe=True)
    .add_step("impute", SimpleImputer())
    .add_step("vitals", "passthrough", deps={"impute": ["age", "sex", "bmi", "bp"]})
    .add_step("blood", PCA(n_components=2, random_state=0), deps={"impute": make_column_selector("s[0-9]+")})
    .add_step("lr", LogisticRegression(random_state=0), deps=["blood", "vitals"])
    .make_dag()
)
dag.show()

I noticed that there is both dag.graph and dag.graph_ stored in memory at different addresses. They seem highly-similar when inspecting the nodes and edges. Is one a reference to the other? Or is one a shallow copy of the other? Or is one a deep copy of the other? Or are they fundamentally different?

galenseilis commented 10 months ago

I'm really excited by skdag. I was working on a similar project when I realized that it solved all the problems I had or wanted to solve.

I am currently trying to build something on top of this which has access to the underyling dag structure. I've encountered an ambiguity I am hoping for technical assistance with.

Suppose I begin with this example from the docs loaded in memory:

from skdag import DAGBuilder
from sklearn.compose import make_column_selector
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

dag = (
    DAGBuilder(infer_dataframe=True)
    .add_step("impute", SimpleImputer())
    .add_step("vitals", "passthrough", deps={"impute": ["age", "sex", "bmi", "bp"]})
    .add_step("blood", PCA(n_components=2, random_state=0), deps={"impute": make_column_selector("s[0-9]+")})
    .add_step("lr", LogisticRegression(random_state=0), deps=["blood", "vitals"])
    .make_dag()
)
dag.show()

I noticed that there is both dag.graph and dag.graph_ stored in memory at different addresses. They seem highly-similar when inspecting the nodes and edges. Is one a reference to the other? Or is one a shallow copy of the other? Or is one a deep copy of the other? Or are they fundamentally different?

I found what might be a clue in the source. graph_ : :class:networkx.DiGraph A read-only view of the workflow.

galenseilis commented 10 months ago

I'm really excited by skdag. I was working on a similar project when I realized that it solved all the problems I had or wanted to solve. I am currently trying to build something on top of this which has access to the underyling dag structure. I've encountered an ambiguity I am hoping for technical assistance with. Suppose I begin with this example from the docs loaded in memory:

from skdag import DAGBuilder
from sklearn.compose import make_column_selector
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

dag = (
    DAGBuilder(infer_dataframe=True)
    .add_step("impute", SimpleImputer())
    .add_step("vitals", "passthrough", deps={"impute": ["age", "sex", "bmi", "bp"]})
    .add_step("blood", PCA(n_components=2, random_state=0), deps={"impute": make_column_selector("s[0-9]+")})
    .add_step("lr", LogisticRegression(random_state=0), deps=["blood", "vitals"])
    .make_dag()
)
dag.show()

I noticed that there is both dag.graph and dag.graph_ stored in memory at different addresses. They seem highly-similar when inspecting the nodes and edges. Is one a reference to the other? Or is one a shallow copy of the other? Or is one a deep copy of the other? Or are they fundamentally different?

I found what might be a clue in the source. graph_ : :class:networkx.DiGraph A read-only view of the workflow.

Also Only defined if all of the underlying root estimators ingraph_expose such an attribute when fit.

big-o commented 10 months ago

Really it's just an implementation detail to make sure the DAG object conforms with the sklearn API. In practice though, it's always good to use graph_ if you want to make use of the graph as the DAG itself does, and graph if you want to see the original inputs that were provided to instantiate the object. At the moment there's no real difference, but it's possible that could change in the future if DAG were ever to modify the inputs in any way when you use it.

To be honest though, I think the whole thing needs to be revamped and simplified so I doubt there will be any changes to this current API until then, and from that point things might work very differently.