Tips on collecting multiple cones

I was wondering if it would be more efficient to have many small cones or few big cones. We are code-generating MPC problems for control, so we have the opportunity to pre-process the problem at no additional runtime cost.

It might make a slight difference, particularly in the case that there are a lot of small cones. I would imagine that for an MPC problem initially ordered stagewise in time, with the dynamics (i.e. state-space / equality constraints) interleaved with inequality constraints, you could end up with such a case. I would guess that it would only matter for a very long horizon problem though, with hundreds of cones minimum.

I think I found the answer in the Clarabel.jl issue tracker oxfordcontrol/Clarabel.jl#86. My understanding is that preprocessing the problem into as few cones as possible would reduce function call overhead from the solver so it should be faster - is that correct?

If you merge all of your dynamics /equalities into a single zero cone, and then all inequalities into a single nonnegative cone, then you should end up with lower function call overhead. A lot of those are very small functions though, so I can't exclude the possibility of aggressive compiler inlining relieving some of that.

However, the code will also internally permute the KKT system to produce sparser factors. For an MPC problem that will put it back into roughly the "natural" stage-wise ordering because that makes the KKT system banded. So you will only have two sets of function calls, one each for the zero and nonnegative cones, but the elements of the cones those calls operate on will be scattered in the KKT system. This memory access pattern might be a bit worse than having more cones but having the memory access more linear.

If speed is really an issue then it's also possible in principle to customize some of the internal solver pieces, particularly the KKT solver component, to exploit the structure of your problem. This might help in particular if your state-space matrices are dense, for example. It's also possible in an MPC problem to get the complexity of the linear system factorisation down from $\mathcal(O)(n)$ to $\mathcal{O}(\log_2(n))$ if parallelization is a possibility for you.

Would it affect the numerical properties in any way, such as significantly changing the factorization?

No, because either way the code is going to reorder your system to make for the best (i.e. sparsest) factorisation anyway, so no matter how you do it you will end up factoring a similar looking system. I would expect that both the iterates and the per-iteration cost will be exactly the same (NB: I haven't tested to confirm).

Is the answer the same for Rust and Julia?

All of the above should hold for both Julia and Rust, with two exceptions:

in Rust we dispatch function calls across cones using enum_dispatch, which is really fast. In Julia we achieve a similar effect using this macro, which is faster than relying on Julia's dynamic dispatch mechanism but still not as fast as Rust. When we tested with a really large number of cones (i.e. >10k), I recall that Rust was a bit faster. Numerical behaviour should be the same though.
If you are calling Julia through JuMP / MOI, then the wrapper will try to merge together consecutive cones that are of the same type in the case of Zero and Nonnegative cones. It will not permute the constraint rows in order to group cones of the same type together though. Neither the Julia or Rust core solver codes perform merging of cones like this.

I'm asking on the docs issue tracker because I first looked at the docs before finding the answer in the issue trackers. I suggest adding a "tips and tricks" or "frequently asked questions" section in the docs for information such as this.

Sure, we can consider adding something like that. Would be interested to hear whether this makes much of a performance difference for you though.

oxfordcontrol / ClarabelDocs

Tips on collecting multiple cones #15