ND.jl - Githubissues

FHell commented 1 year ago

Hi, thanks for bringing this to our attention!

For homogeneous diffusion networks I would always write the rhs directly using the Graph Laplacian. This is absolutely the worst use case for ND.jl. Its hard to compete with matrix multiplication.

That said, on my machine this code here is pretty fast:

p_nd = ([a_, b_], L_, [D_u_, D_v_]) # Tuple here for type stability!!

@inline function rd_edge_2!(e, v_s, v_d, p, t)
    e[1] = v_s[1] - v_d[1]
    e[2] = v_s[2] - v_d[2]
    nothing
end

@inline function rd_vertex_2!(dv, v, edges, p, t)
    @inline brusselator!(dv, v, p[1], t)
    for e in edges
        dv[1] += p[3][1] * e[1]
        dv[2] += p[3][2] * e[2]
    end
    nothing
end

nd_rd_vertex_2 = ODEVertex(; f=rd_vertex_2!, dim=2)
nd_rd_edge_2 = StaticEdge(; f=rd_edge_2!, dim=2)

nd_rd_2! = network_dynamics(nd_rd_vertex_2, nd_rd_edge_2, g)

b_nd_2 = @benchmark nd_rd_2!(du_vec, u_vec, (p_nd, nothing), 0.0)

I tried N=1000. I didn't have time to wait for MTK to build the equation systems, so no MTK numbers, but on my machine this takes

23.107 μs

vs

Dense Matrix: 111.850 μs Sparse Matrix: 17.034 μs

So ND.jl code is actually in reach of sparse matrix multiplication, which is what I recall from our internal benchmarks. In our own examples we sometimes beat hand rolled sparse matrix multiplication networks. E.g. here:

https://github.com/PIK-ICoNe/NetworkDynamics.jl/blob/main/examples/kuramoto_inertial.jl

MTK is brilliant and might one day supersede the need for ND.jl, but if you want to be fair, you should also measure how long it takes to actually get your RHS with it. For large systems this is a concern.

It seems that your way of handling parameters caused a lot of problems for ND.jl and we need to investigate that/write tutorials. We will anyway completely overhaul parameter handling next year. Just changing the parameters to a tuple and passing them using our (p_nodes, p_edges) convention already makes the ND.jl code you wrote as fast as the dense matrix multiplication:

results

I just also just reran your notebook with the parameters as tuple tweak and the two versions of ND.jl for N = 5 and N = 100 so we can get some MTK numbers.

N = 5

results

and for N=100

results

FHell commented 1 year ago

A few follow up notes. Optimizing this a bit further you see that propagate inbounds actually makes a difference now. This is the fastest I could get, about a factor 2 slower than sparse matrix multiplication which I consider an excellent result:

p_nd = (a_, b_, D_u_, D_v_, b_ + 1.) # Tuple here for type stability!!

@inline Base.@propagate_inbounds function rd_edge_2!(e, v_s, v_d, _, t)
    e[1] = v_s[1] - v_d[1]
    e[2] = v_s[2] - v_d[2]
    nothing
end

# brusselator_x(x,y,a,b) = a + x^2 * y - b*x -x
# brusselator_y(x,y,a,b) = b*x - x^2 * y

@inline Base.@propagate_inbounds function rd_vertex_2!(dv, v, edges, p, t)
    for e in edges
        dv[1] += e[1]
        dv[2] += e[2]
    end

    dv[1] *= p[3] 
    dv[2] *= p[4]

    dv[1] += p[1] + v[1]^2 * v[2] - p[5]*v[1]
    dv[2] += p[2] * v[1] - v[1]^2 * v[2]

    nothing
end

nd_rd_vertex_2 = ODEVertex(; f=rd_vertex_2!, dim=2)
nd_rd_edge_2 = StaticEdge(; f=rd_edge_2!, dim=2)

nd_rd_2! = network_dynamics(nd_rd_vertex_2, nd_rd_edge_2, g)

b_nd_2 = @benchmark nd_rd_2!(du_vec, u_vec, (p_nd, nothing), 0.0)

results

Some further points:

You only measures function evaluation. MTK provides Jacobians. We don't do that yet. Actually integrating MTK using DE.jl might be dramatically faster for that reason alone. If MTK works for your usecase use it.
We have actually had success with running modelingtoolkitize on network dynamics functions. We are planning to integrate ND.jl and MTK.jl more, hopefully getting the best of both worlds, but we haven't had the resources to pursue this yet. Hopefully next year.
There might be (minor) performance optimizations in your other versions to be had. I didn't check. I only turned all the Arrays into tuples.
For such a simple coupling the :antisymmetric optimization actually hurts.
Obviously it is not great that the path of performance is so narrow here, but in general passing around Array{Any} is absolutely something that can cause unexpected performance problems and hinder many optimizations all over the place. Using Tuples for heterogeneous data and avoiding untyped Arrays is a strong best practice that the community should teach from the start.

FHell commented 1 year ago

Finally:

If you note unexpectedly bad performance in ND.jl feel free to open an issue on our Github!

irregular-rhomboid commented 1 year ago

Thanks for the response! Those are all very interesting points. I definitely should have thought about the tuple thing. I'll update the post with those comments.

irregular-rhomboid commented 1 year ago

I've updated the code and blog post based on your suggestions. I haven't implemented all of them, as I intend to return to this at a later date with a much more comprehensive set of benchmarks. I'll be sure to report on ND's repo if I find anything.

FHell commented 1 year ago

Awesome thanks. Two things I didn't make explicit, and that we should mention in the documentation:

Broadcasting inside the vertex/edge function reduces performance for ND.jl. The reason is that internally we map the vertex and edge functions to the appropriate locations in the arrays using views. If you index these views explicitly the compiler can optimize that down to a simple index multiplication. If you broadcast over them it can't for some reason.

This is really unfortunate, but we didn't find a way around it yet.

A second point: You mention that ND.jl is good for heterogeneous networks, and that's absolutely true. But actually what we also use it for is for networks with complicated couplings on the edges. I linked you the example with inertial Kuramoto oscillators above:

https://github.com/PIK-ICoNe/NetworkDynamics.jl/blob/main/examples/kuramoto_inertial.jl

Here we are faster than a hand written version using sparse matrices.

More context and where we are planning to go:

We reached a point where everything is "fast enough", the serious next step is to make everything run on GPUs. There is an experimental branch, but no serious attempt to get the whole package there. Your suggestion to look at LoopVectorization again is probably also a good idea, though its not as simple as just throwing a macro at the right loop unfortunately.

If you are planning a more extensive benchmark suite for networked systems we would be more than happy to also directly link that from our repo, that would be pretty useful to keep track of what's going on in the ecosystem!

irregular-rhomboid / network-dynamics

ND.jl #1