niessner / Opt

Opt DSL
Other
256 stars 68 forks source link

Trying to assess if opt would provide a significant speedup to my project #85

Open mihaibujanca opened 7 years ago

mihaibujanca commented 7 years ago

I'm currently implementing a paper requiring an optimisation with lots of parameters, running at framerate. I have implemented part of the optimisation routine in ceres already and will soon need to implement an ARAP regularisation term - so I'm expecting your examples would come in handy if I am going to use Opt.

I was wondering

  1. What does Opt use Ceres for (evaluating jacobians, solving?)
  2. Would it be hard to convert the Ceres code into Opt code and could I expect a significant speedup based on this breakdown of what Ceres spends time on:
Time (in seconds):
Preprocessor                           1.0788

  Residual evaluation                  0.1961
  Jacobian evaluation                 61.3958
  Linear solver                       34.5845
Minimizer                            115.2516

Postprocessor                          0.0222
Total                                116.3526

Thank you

Mx7f commented 7 years ago

Hello! We're happy you are considering Opt!

  1. Opt does not use Ceres; it is only used as a basis for comparison.
  2. The code seems fairly straightforward to port to Opt if I understand it correctly; you are generating residuals by doing a little math on the 8 closest vertices found by a kNN search. You can encode this 9-way connectivity in 9-vertex hyperedges, store the vertex data in a cuda buffer and write the straightforward energy function. We saw 1-3 order of magnitude improvements in performance over CERES on all our test problems; none of said problems had 9-way connectivity so you are doing something different, but if I had to bet I'd say you'd get a significant speedup.

Out of curiosity, what are the unknown and residual counts on your problem?

mihaibujanca commented 7 years ago

Hi, thanks for the clarification. This is the output from Ceres, but in terms of what will actually be estimated by the optimisation, it's 6 parameters (3 for rotation and 3 for translation) per node for a few hundred "deformation nodes" (starting with about 300-350 and growing indefinitely over time).

The plan is to have one residual per node and estimate 6 parameters per node, hoping to achieve that at frame rate.

Solver Summary (v 1.13.0-eigen-(3.3.90)-lapack-suitesparse-(4.4.6)-cxsparse-(3.1.4)-openmp)

                                     Original                  Reduced
Parameter blocks                         4767                     4767
Parameters                              28602                    28602
Residual blocks                        108992                   108992
Residual                               326976                   326976

Minimizer                        TRUST_REGION

Sparse linear algebra library    SUITE_SPARSE
Trust region strategy     LEVENBERG_MARQUARDT

                                        Given                     Used
Linear solver                    SPARSE_SCHUR             SPARSE_SCHUR
Threads                                     8                        8
Linear solver threads                       8                        8
Linear solver ordering              AUTOMATIC                 349,4418
Schur structure                         3,6,6                    d,d,d

Cost:
Initial                          3.293113e+06
Final                            2.295137e-15
Change                           3.293113e+06

Minimizer iterations                        4
Successful steps                            4
Unsuccessful steps                          0
mihaibujanca commented 7 years ago

Any particular example I should be looking at to better understand what I have to do to implement my routine using Opt? I'm not great with cuda but learning reasonably fast :)

Mx7f commented 7 years ago

Cotangent Mesh Smoothing (examples/cotangent_mesh_smoothing) is an example of a problem that uses hyperedges (of degree 4, representing a triangle wedge; you'll be using degree 9), so it might be useful to look at that.

CombinedSolver.h is what implements the problem-specific CPU side code in all of the examples, but it leans heavily on code in shared/ (specifically OptImage, OptGraph, and OptUtils.h, which are all part of the helper library I made for the examples and not for Opt proper).

If you'd rather look at a minimal use of the C API, take a look at tests/minimal, which shows a laplacian image example with no wrappers around the Opt and CUDA API calls. The only part really different for your use case is needing to copy over the hyperedge buffers as well as the unknowns.

The plan is still to have some small, fixed "k" unknowns that are used in the contribution to each residual, right?

mihaibujanca commented 7 years ago

Thanks, will look into that and come back with further questions if needed :).

Yes, it will always be a small, fixed number of unknowns contributing to each residual, though once I have something that runs fast enough I'll try to vary the number and see how that affects the result.

mihaibujanca commented 7 years ago

Quick question - what does the usePreconditioner function do? Most examples seem to set it to true

Mx7f commented 7 years ago

It enables Jacobi preconditioning for the linear conjugate gradient solver. It can improve numerical stability, but increases per-iteration compute slightly.

Mx7f commented 7 years ago

Is this going well? Trying to assess whether I should close this issue.

mihaibujanca commented 7 years ago

Had to take a break from this but now I'm back to it. I'd leave it open for a few more days if that's all right, I might be back with a few more questions

mihaibujanca commented 7 years ago

Guess this has its answer now. The speedup is significant (about 2-2.5 orders of magnitude), however the available methods don't seem to work as well for my problem so far. Will still try fiddling with Opt to try and get it working

Mx7f commented 7 years ago

What do you mean by available methods? The library functions? Or GN & LM? If so, what makes LM inappropriate?

mihaibujanca commented 7 years ago

Yeah I mean GN & LM. At the moment using LM converges to a local minima that is really far from what I previously got with Ceres, but I am looking into a few options to get it working with LM

mihaibujanca commented 7 years ago

I'm currently looking at the Robust nonrigid alignment since the problem is similar. Still getting through that code though it looks a bit more complicated than the rest of the examples

Mx7f commented 7 years ago

Your Ceres implementation should internally should be using LM (or dogleg, which should probably converge to the same minima). Is the problem that you can't express the same energy function as Ceres for one reason or another?

mihaibujanca commented 7 years ago

Not sure about the way it internally works in Ceres (will have a look at some point), but setting options.linear_solver_type = ceres::SPARSE_SCHUR; gets me to a low cost (but takes quite long, to be fair).

I'm just trying to get the same values from Opt and Ceres for LM at the moment. The energies seem fairly similar but I'm sure there must be some difference

Mx7f commented 7 years ago

That option sets the linear solve to an exact instead of iterative method. The outer algorithm is still LM (or dogleg). It is possible for LM with two different linear solvers to converge to different local minima (for highly nonlinear problems), but I don't think there is a general way to a priori determine which one will lead to a lower cost.

Is the initial cost for Opt and Ceres the same? Are you initializing the unknowns to the same values?

mihaibujanca commented 7 years ago

The initial cost is not the same. The only difference I could find so far is that in Ceres I was initialising the unknowns to 0 and then modifying the deformation nodes at the end, whereas with Opt I am just initialising with the deformation nodes.

Mx7f commented 7 years ago

"at the end" referring to before the solve, right?

The initial costs should be identical to several decimal places if the data and energies are the same (the only differences would be in the precision of intrinsic operations and the non-associativity of floating point operations, which will be very small),

One mistake I made while working on the initial Opt examples was only partially implementing the cost in Opt and/or Ceres (like just doing the translation term of ARAP). If your energy consists of several terms, you could enable each term one by one to see which term is causing the disparity.

mihaibujanca commented 7 years ago

There will be multiple terms (A "data" term using rotation and offset and an ARAP regularisation) but right now I'm just testing with something really simple (weighted sum of offsets) - on both Ceres and Opt.

What I mean is that at the moment in Opt the unknowns are the deformation nodes (specifically just the offset part) whereas in Ceres they are 0 and I'm adding the deformation node offset to that in the energy function. Shouldn't make a difference but I'm modifying it to match what I have in Opt just to be sure I didn't mess it up somewhere.

mihaibujanca commented 7 years ago

Guess I'll reopen this since I have some follow-ups. I've created a very simple test and my only reasonable explanation is that somehow the weights must be mismatched - still looking into this. My test is using weights that are dependent on the distance between the neighboring nodes and the source vertex, and the solution is a rigid transformation, so it should be reasonably easy to find a good solution, provided that the weights are indeed matched correctly.

The test works if all weights are set to 1.

So, just to check, am I doing this the right way?

        int N = canonical_vertices.size();
        m_weights = createEmptyOptImage({N}, OptImage::Type::FLOAT, KNN_NEIGHBOURS, OptImage::GPU, true);  
       std::vector<float[KNN_NEIGHBOURS]> weights(N);
        for(int count = 0; count < N; count++)
        {
            m_warp->getKNNWeights(canonical_vertices[count], weights[count]);
            for(int i = 1; i < graph_vector.size(); i++)     //// graph_vector.size is KNN_NEIGHBOURS+1
                graph_vector[i].push_back((int)m_warp->getRetIndex()->at(i-1)); // this is 1..size because of how the graph is structured
        }
        m_weights->update(weights);
        m_problemParams.set("Weights", m_weights);
local Weights = Array("Weights", opt_float8, {N}, 6) -- KNN_NEIGHBOURS is 8
local G = Graph("DataG", 7,
                    "v", {N}, 8,
                    "n0", {D}, 9,
                    "n1", {D}, 10,
                    "n2", {D}, 11,
                    "n3", {D}, 12,
                    "n4", {D}, 13,
                    "n5", {D}, 14,
                    "n6", {D}, 15,
                    "n7", {D}, 16)

weightedTranslation = {} -- Lua docs say all values would be initialised to 0
nodes = {0,1,2,3,4,5,6,7}
for _,i in ipairs(nodes) do
    weightedTranslation[G.v] = weightedTranslation[G.v] + Weights(G.v)(i) * TranslationDeform(G["n"..i])
end
Energy(LiveVertices(G.v) - CanonicalVertices(G.v) - weightedTranslation[G.v])
mihaibujanca commented 7 years ago

Again my understanding of Opt letting me down, but shouldn't all these three be equivalent?

for _,i in ipairs(nodes) do
    weightedTranslation[G.v] = weightedTranslation[G.v] + Weights(G.v)(i) * TranslationDeform(G["n"..i])
end
Energy(LiveVertices(G.v) - CanonicalVertices(G.v) - weightedTranslation[G.v])
Energy(LiveVertices(G.v) - CanonicalVertices(G.v))
for _,i in ipairs(nodes) do
    Energy( 0 - Weights(G.v)(i) * TranslationDeform(G["n"..i]))
end
for _,i in ipairs(nodes) do
    Energy( (LiveVertices(G.v) - CanonicalVertices(G.v))/8 - Weights(G.v)(i) * TranslationDeform(G["n"..i])) -- divided by the number of nodes 
end

The first option results in a larger error than the desired threshold (the error is about 0.007), the second one results in 0 cost change and the 3rd one actually works (on the simplest test, still fails on more complex ones). The cost is different for each as well.

mihaibujanca commented 7 years ago

Ah, figured out the above in the meanwhile, should have been obvious.

Each energy is squared so one of them is (r1+r2+...)^2 whereas the other is r1^2 + r2^2...

Mx7f commented 7 years ago

So are your questions answered now?

There was obviously some mismatch in the behavior between what you expected to happen and what happened; which we should be able to mitigate in the future with a combination of improved documentation/warnings/errors. What do you think would have helped the most?

mihaibujanca commented 7 years ago

Hmm, I'll need to think about it for a bit - you're already mentioning that the energies are already squared in the documentation and that's also obvious from the cost.

The one question that is still floating around is whether or not the following is correct:

local weightedTranslation = {}
for _,i in ipairs(nodes) do
    weightedTranslation[G.v] = weightedTranslation[G.v] + TranslationDeform(G["n"..i]) * Weights(G.v)(i)
end
Energy(LiveVertices(G.v) - CanonicalVertices(G.v) - weightedTranslation[G.v])

i.e I want to have a weighted translation for each vertex pair

The above fails to produce the expected result - I tried testing it with a single vertex (and with multiple vertices, separately) and the cost stays reasonably high even though the solution is straightforward.

On the other hand, the following works for a single vertex, but not for multiple vertices

local weightedTranslation = 0
for _,i in ipairs(nodes) do
    weightedTranslation = weightedTranslation + TranslationDeform(G["n"..i]) * Weights(G.v)(i)
end
Energy(LiveVertices(G.v) - CanonicalVertices(G.v) - weightedTranslation)
Mx7f commented 7 years ago

The two code snippets are essentially identical, except for the variable name, and the initial value (nil in the first snippet, 0 in the second. I don't actually know the behavior of adding to nil offhand, so I'd always use the second one).

I don't really have enough information to go on to help you with the multiple vertices issue. What might help is setting Opt's verbosity to 2 on initialization, and looking at the schedule for the cost function (noting that there will be one cost function per graph/stencil access domain), and seeing if it looks plausibly like what you wanted.

mihaibujanca commented 7 years ago

I guess my real question here is whether weightedTranslation will start at 0 for each of the vertices or if it will cumulate across vertices (in the second case). If so, is there a way to mitigate the issue by initialising an array to 0 for each element and compute the weightedTranslation as in the first example?

Mx7f commented 7 years ago

weightedTranslation will start at 0 for every vertex. You are essentially writing an energy template that will execute once per graph edge (or once per unknown in the case of using offset accesses instead of graphs).