mschauer / CausalInference.jl

Causal inference, graphical models and structure learning in Julia
https://mschauer.github.io/CausalInference.jl/latest/
Other
190 stars 24 forks source link

cpdag(g::DiGraph) #83

Closed mschauer closed 1 year ago

mschauer commented 1 year ago

Check if iteration of

for w in inneighbors(g, x)
       label[w => x] == compelled || continue

instead of only iterating over the compelled incoming edges of x could lead to worse asymptotic run-time than O(n+m). @mwien

https://github.com/mschauer/CausalInference.jl/blob/57c716fd5ce84137efb8dd147c698dd5646e91e7/src/cpdag.jl#L110C1-L110C1

mwien commented 1 year ago

It appears this issue is not as bad as I first thought.

In principle, the problem occurs if we have a vertex $x$ with a lot of non-compelled incoming edges and a lot of children $y$ (say each with in-degree 1, that is only having $x$ as parent).

Assume $x$ has $k$ non-compelled incoming edges and there are $l$ children of $x$. Then, we have a cost of $k \cdot l$ and only $k + l$ edges. Thus, its not clear the run-time is bounded by the number of edges anymore.

The thing is to get $k$ non-compelled incoming edges, those $k$ parent vertices have to form a clique. So we get another $O(k^2)$ edges and the above worst-case of doesn't seem to occur.

However, if you choose $k = O(\sqrt{l})$, we have $k\cdot k + k + l = O(l)$ edges, but run-time $k\cdot l = O(l \cdot \sqrt{l})$. So, the run-time can be $O(m \cdot \sqrt{m})$ for some instances, but I guess it can't be much worse than that.

function gen_graph(k)
    g = SimpleDiGraph(k*k + k + 1)
    for i = 1:k, j = (i+1):k
        add_edge!(g, i, j)
    end
    for i = 1:k
        add_edge!(g, i, k+1)
    end
    for i = 1:k*k
        add_edge!(g, k+1, i+k+1)
    end
    return g
end

Called for $k=1000$ it seems noticably slower than an implementation, which only iterates over compelled parents (e.g. alt_cpdag(g) here: https://github.com/mwien/CausalInference.jl/blob/idalgorithms/src/cpdag.jl).

There is another catch though: the real problem is topological_sort_by_dfs from the Julia Graphs Package. Unfortunately, the run-time of this function can be $O(n^2)$ even for sparse graphs. So I had to first replace this with a quick implementation of my own (https://github.com/mwien/CausalInference.jl/commit/e5ee3b033add48b32207236be857e32260a79a30).

mschauer commented 1 year ago

Is this a problem with topological_sort_by_dfs or our use case, in other words should we open an issue on Graphs.jl?

mwien commented 1 year ago

We should open an issue. I'm trying to pinpoint the exact problem at the moment.

function gen_graph_dfs(k)
    g = SimpleDiGraph(2*k+1)
    for i = 1:k
        add_edge!(g, i, 2*k+1)
        add_edge!(g, 2*k+1, k+i)
    end
    return g
end

This is an even simpler example. E.g. if you run with k = 10^5 (that will give a 2*10^5+1 node graph) and then call topological_sort_by_dfs it already takes a few seconds.

mwien commented 1 year ago

I think I got it now. This is the simplest possible example I think is:

function gen_graph_dfs(k)
    g = SimpleDiGraph(k+1)
    for i = 1:k
        add_edge!(g, 1, i+1)
    end
    return g
end

The problem is that this loop (https://github.com/JuliaGraphs/Graphs.jl/blob/e5405d105e034ccb9396d30f86c26641ce94c504/src/traversals/dfs.jl#L97) goes through the neighbors of u over and over everytime starting at the first.

So it will find n with vcolor[n] == 0 and break, then (with some further steps vcolor[n] will be set to 2). Afterwards, again the search over the neighbors of u starts at the first one until it finds one with vcolor 0 and breaks etc.

Hence, for a vertex with $k$ neighbors its $k^2$ and hence in total it can be $O(m^2)$ like for the graph above...

mwien commented 1 year ago

So, in summary topological_sort_by_dfs can be $O(m^2)$, whereas the function cpdag is "only" $O(m\cdot\sqrt{m})$ in the worst-case (assuming an $O(m)$ implementation of topological sort).

mschauer commented 1 year ago

Do you want to open the issue or should I do that?

mwien commented 1 year ago

I'm not so experienced, would be glad to get some help or have you open the issue directly. What label should the issue have? "bug" or something else?

mschauer commented 1 year ago

Write something "Possible issue with topological_sort_by_dfs" and explain what you saw and then they can figure out if

Possible issue with topological_sort_by_dfs

Looks like the topological_sort_by_dfs can have a run-time O(n*n) even for sparse graphs. This is the simplest possible example is

function gen_graph_dfs(k)
    g = SimpleDiGraph(k+1)
    for i = 1:k
        add_edge!(g, 1, i+1)
    end
    return g
end

The problem is that this loop (https://github.com/JuliaGraphs/Graphs.jl/blob/e5405d105e034ccb9396d30f86c26641ce94c504/src/traversals/dfs.jl#L97) goes through the neighbors of u over and over everytime starting at the first.

So it will find n with vcolor[n] == 0 and break, then (with some further steps vcolor[n] will be set to 2). Afterwards, again the search over the neighbors of u starts at the first one until it finds one with vcolor 0 and breaks etc.

Hence, for a vertex with $k$ neighbors its $k^2$ and hence in total it can be $O(m^2)$ like for the graph above...

mwien commented 1 year ago

Thanks, I submitted here: https://github.com/JuliaGraphs/Graphs.jl/issues/263

mschauer commented 1 year ago

Appreciated!