niessner / Opt

Opt DSL
Other
256 stars 68 forks source link

Cost not changing across iterations #100

Closed mihaibujanca closed 7 years ago

mihaibujanca commented 7 years ago

This is probably some mistake of mine coming from not understanding everything well enough yet, but the solver seems to just compute the cost once and then finish. Apologies if this very specific to my issue and perhaps too open ended.

I might be missing something very basic in my C code (most likely), but here's the output from Opt:

final cost=44.351135
--------------------------------------------------------
        Kernel        |   Count  |   Total   | Average 
----------------------+----------+-----------+----------
----------------------+----------+-----------+----------
 overall              |      1   |   23.553ms| 23.5534ms
----------------------+----------+-----------+----------
 computeCost_Graph_DataG |      4   |    0.070ms|  0.0175ms
----------------------+----------+-----------+----------
 PCGInit1_Graph_DataG |      3   |    0.097ms|  0.0322ms
----------------------+----------+-----------+----------
 PCGStep1_Graph_DataG |    600   |   12.564ms|  0.0209ms
--------------------------------------------------------
TIMING 23.553375 0.096640 12.563610 
Per-iter times ms (nonlinear,linear):  0.0966   12.5636
===Robust Mesh Deformation===
**Final Costs**
Opt GN,Opt LM,CERES
4.43511352539062500000e+01,,

Also gist with verbosityLevel set to 2..

Here's my CombinedSolver.h.

mihaibujanca commented 7 years ago

My bad, the issue there was that I had forgotten to assign the parameters. Now there are multiple iterations but the cost is still not changing. Attached log

Mx7f commented 7 years ago

See the last comment here: #87

There's an open bug when using solely graph energies. Use the hack there for now, I should have it fixed this week.

mihaibujanca commented 7 years ago

Oh, I missed that one!

Still a little bit unsure of how to use this. as I need to have everything in the same domain.

This is my current code:

local G = Graph("DataG", 7,
                    "v", {N}, 8,
                    "n0", {D}, 9,
                    "n1", {D}, 10,
                    "n2", {D}, 11,
                    "n3", {D}, 12,
                    "n4", {D}, 13,
                    "n5", {D}, 14,
                    "n6", {D}, 15,
                    "n7", {D}, 16)

weightedTranslation = 0

nodes = {0,1,2,3,4,5,6,7}

for _,i in ipairs(nodes) do
    weightedTranslation = weightedTranslation + Weights(G.v)(i) * TranslationDeform(G["n"..i])
end

local cost = LiveVertices(G.v) - CanonicalVertices(G.v) + weightedTranslation
local zeroIm = ComputedImage("zero",{N}, 0.0)

Energy(zeroIm(0) * cost)

Which results in residual contains image reads from multiple domains.

Not exactly sure how to correlate the sumOfParams in the other issue with what I have

Mx7f commented 7 years ago

The hack cost should be a separate residual term that shouldn't use graphs at all.

On Mon, Oct 23, 2017 at 18:12 Mihai notifications@github.com wrote:

Oh, I missed that one!

Still a little bit unsure of how to use this. as I need to have everything on one domain.

This is my current code:

local G = Graph("DataG", 7, "v", {N}, 8, "n0", {D}, 9, "n1 https://maps.google.com/?q=9,%0D+%22n1&entry=gmail&source=g", {D}, 10, "n2", {D}, 11, "n3", {D}, 12, "n4 https://maps.google.com/?q=12,%0D+%22n4&entry=gmail&source=g", {D}, 13, "n5 https://maps.google.com/?q=13,%0D+%22n5&entry=gmail&source=g", {D}, 14, "n6", {D}, 15, "n7", {D}, 16)

weightedTranslation = 0

nodes = {0,1,2,3,4,5,6,7}

for _,i in ipairs(nodes) do weightedTranslation = weightedTranslation + Weights(G.v)(i) * TranslationDeform(G["n"..i]) end

local cost = LiveVertices(G.v) - CanonicalVertices(G.v) + weightedTranslation local zeroIm = ComputedImage("zero",{N}, 0.0)

Energy(zeroIm(0) * cost)

Which results in residual contains image reads from multiple domains.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/niessner/Opt/issues/100#issuecomment-338841390, or mute the thread https://github.com/notifications/unsubscribe-auth/ACww7XkofA29BibeS---kbec_q14_v6Qks5svTl6gaJpZM4QDpfm .

mihaibujanca commented 7 years ago

Oh alright, now it seems to be working (or at least getting to a low cost, I still need to test if the values are correct), however lots of iterations result in 0 cost change, and breaking at iteration x, usually < 20 iterations. Is there any reason that might often happen?

Mx7f commented 7 years ago

breaking at iteration x happens if the linear system converges quickly. 0 cost change and reverting is a natural part of highly nonlinear solves when using Levenburg-Marquadt. If the first several nonlinear iterations all revert, you could set the initial trust_region_radius to a lower value to skip that part of the solve.

mihaibujanca commented 7 years ago

Covered by #91

mihaibujanca commented 7 years ago

Not sure if I should reopen or open a different issue for this since what happens now may or may not be related - I am now trying to test my program on a proper dataset (it worked correctly on handmade data).

For some reason the cost drops quickly but then stays at a reasonably large value (cost starts at 18376534 and drops to 190 and stays there).

I understand, of course this would be very hard to debug remotely but any pointers would be appreciated. Worth noting that on the same data, Ceres was converging with a cost of 1e-8, and I tried to reproduce the cost function as well as I could.

Here's the first iteration. Subsequent iterations stay at the same cost.

//////////// ITERATION0  (Opt(LM)) ///////////////
zeta=-0.000101857571280561388, breaking at iteration: 4
 cost=18376534.000000 
 model_cost=48635.292969 
 model_cost_change=18327898.000000 
zeta=-5.81109043196192943e-06, breaking at iteration: 24
 cost=48635.277344 
 model_cost=234.928070 
 model_cost_change=48400.347656 
 cost=234.928040 
 model_cost=193.840088 
 model_cost_change=41.087952 
 cost=193.840103 
 model_cost=191.666962 
 model_cost_change=2.173141 
 cost=191.666870 
 model_cost=190.530045 
 model_cost_change=1.136826 
zeta=-0.00317946262657642365, breaking at iteration: 50
 cost=190.530014 
 model_cost=190.435852 
 model_cost_change=0.094162 
zeta=-0.00204528076574206352, breaking at iteration: 40
 cost=190.435913 
 model_cost=190.403366 
 model_cost_change=0.032547 
zeta=-0.00184173777233809233, breaking at iteration: 170
 cost=190.403275 
 model_cost=190.334518 
 model_cost_change=0.068756 
 cost=190.334457 
 model_cost=190.279221 
 model_cost_change=0.055237 
zeta=-0.0014018454821780324, breaking at iteration: 100
 cost=190.279312 
 model_cost=190.268097 
 model_cost_change=0.011215 
zeta=-0.000233855316764675081, breaking at iteration: 60
 cost=190.268112 
 model_cost=190.264236 
 model_cost_change=0.003876 
 cost=190.264175 
 model_cost=190.254593 
 model_cost_change=0.009583 
zeta=-0.0033028090838342905, breaking at iteration: 50
 cost=190.254654 
 model_cost=190.252869 
 model_cost_change=0.001785 
 cost=190.252716 
 model_cost=190.251801 
 model_cost_change=0.000916 
zeta=-0.00542288646101951599, breaking at iteration: 90
 cost=190.251755 
 model_cost=190.249207 
 model_cost_change=0.002548 
zeta=-0.0132675953209400177, breaking at iteration: 92
 cost=190.249298 
 model_cost=190.249191 
 model_cost_change=0.000107 
mihaibujanca commented 7 years ago

Oh, worth mentioning the Ceres version was using SPARSE_SCHUR, this is obviously using LM

Mx7f commented 7 years ago

First thing to try is using double precision instead of single precision to see if it is a precision issue. I don't know if your example or data is sensitive, but I also don't mind fiddling with things to see what the issue is.

Warning: if your GPU is before the Pascal generation (10XXs) Opt must use slower software implementations of double-precision floating point atomics so you may see a significant slowdown.

mihaibujanca commented 7 years ago

I am already using double precision so that wouldn't be the problem, but that's good to know - I only have a GeForce 960M. Floating point precision should be good enough for what I need in theory so I might change too that later

mihaibujanca commented 7 years ago

Is there any obvious reason why using GN would have NaN cost but LM would work?

Mx7f commented 7 years ago

GN is not guaranteed to even locally converge (and often doesn't if the Jacobian is ill conditioned). LM is.

Three sanity checks you can do is to try the result of your ceres implementation as the starting values of your Opt implementation or vice versa, or try ceres LM implementation.

mihaibujanca commented 7 years ago

I'll need to look into this in more detail. Did the test with LM on Ceres and it indeed does stay at a cost that is of the same order of magnitude (and indeed not too far, Ceres' final cost is 165, Opt's is 189). It seems like Ceres starts at a much higher cost for some reason, but converges after 3 iterations to 165. Opt gets to 191 after 4 iterations and then takes really small steps until reaching a value around 189.

Mx7f commented 7 years ago

Hmm, sounds likely that there is some difference in the energy function. Costs should at least match initially (and probably at the end too).

Mx7f commented 7 years ago

Of note, the hack energy is no longer needed on the latest version of master (see #91).

mihaibujanca commented 7 years ago

Yay! Can confirm it's working without the hack