performance regression according to benchmark_elbo.jl

jeff-regier commented 8 years ago

The script benchmark.jl now takes 165 seconds to complete. Earlier it took 24 seconds to compute the same result. The amount of memory allocated during it's execution increased 10 fold during this time.

A new function named combine_sfs! alone consumes 78 seconds, so that's a good place to start investigating. The implementation of combine_sfs! looks efficient, so presumably the problem is with how frequently combine_sfs! is being called rather than something internal to this function. (Though we should check it with code_warntype.)

@rgiordan -- Any details you can add? Is it plausible that combine_sfs! is getting called much more often than necessary?

rgiordan commented 8 years ago

I don't think it's getting called unnecessarily, but there are ways that we could combine certain steps to reduce the number of calls or make certain calls more efficient. Incidentally, it is only used a handful of times, as can be confirmed by a text search.

Instead of calculating var_G directly (as was done in the previous code), I calculate E[G^2] and E[G], and then use a call to combine_sfs! to calculate the variance from them. That could probably be eliminated at the expense of making some of the direct expressions a little more complex.
It is totally unnecessary that I use two calls to combine_sfs! to both add the log term to the ELBO and subtract iota * E_G. That could be combined into a single expression.
multiply_sfs! uses combine_sfs!, and multiply_sfs! is only used in the brightness calculations, and often only actually updates a small number of parameters each time. However, IIRC, multiply_sfs! was not using much of the time and is probably not the real problem.

rgiordan commented 8 years ago

PS I'll take a moment to take a stab at the first one, which is the most complicated. It shouldn't take very long.

rgiordan commented 8 years ago

Ok, calculating the variance directly probably isn't going to save much time. There is no clean way to avoid doing something much like combine_sfs! for combining the Hessian, since each term in the variance Hessian requires terms from the E_G Hessian in a way that isn't easy to include in the current loop through sources. That means another loop through the Hessians afterwards, which is exactly what combine_sfs! is doing.

rgiordan commented 8 years ago

It occurs to me, though, that julia might not be smart about multiplying by 0. If each multiplication by 0 is actually performing a floating point calculation, then there are a lot of savings to be made in the variance calculation and adding E_G.

I'm going to try that very quickly and then cut myself off -- I've already spent too much time on this for pre-quals weeks. :)

jeff-regier commented 8 years ago

I observed, as you suggested, that the run time of the new code scales superlinearly in the number of sources. I tried the same test with the old code and observed linear scaling. Could any of the issues you mentioned also explain that?

This might be a long shot, but one idea: in the old code, for each tile, the derivatives only have entries for the local sources, not all the sources in the field. Is that still the case for the new code?

On Fri, Jan 29, 2016, 3:31 PM Ryan notifications@github.com wrote:

PS I'll take a moment to take a stab at the first one, which is the most complicated. It shouldn't take very long.

— Reply to this email directly or view it on GitHub https://github.com/jeff-regier/Celeste.jl/issues/104#issuecomment-177017232 .

rgiordan commented 8 years ago

All the calculations up to E_G and var_G should be aware of the active sources in the tile, and only do the necessary calculations. However, it does appear that when combining these to get the ELBO, combine_sfs! is not aware which sources have nonzero derivatives. Again, perhaps a lot of unnecessary multiplies by zero.

rgiordan commented 8 years ago

Ok, doing special-case versions of combine_sfs! for adding the elbo log term and calculating the variance does make a modest difference. I'll make a pull request once Travis passes.

Given this, I'll bet you're right, that we could also get gains by only updating elbo derivatives for sources that are active in the current tile. (That's can't be the whole story, though -- it's still slower, even with a single source.)

rgiordan commented 8 years ago

It's in your court for a while.

rgiordan commented 8 years ago

One other thing comes to mind -- for some reason, at some point, I decided to keep all the fsXm values in a big vector (elbo_vars.fsXm_vec) rather than overwriting them as I go. That is not necessary, and something that is related to the number of sources. I don't see how it could be causing a problem but I wanted to point it out.

jeff-regier commented 8 years ago

The last commit where benchmark_elbo.jl runs fast has the comment Notes and the first commit where it's slow has the comment Setting up debugging for comparsion with the old version. About 20 commits in between that can't be tested easily. @rgiordan -- Any suggestions for what commits in between are most likely to introduce the slowdown? I could use some leads....

rgiordan commented 8 years ago

You're not going to be able to test it this way. I rewrote ElboDeriv.jl from the inside out, starting with the Hessians for the brightness and bivariate normal calculations, and then working out to the full ELBO. As soon as I started down that path, the end product, elbo_likelihood(), wouldn't run until I was finished.

Instead, maybe compare sub-components of the ELBO one at a time. That could lead to insights on its own, and will also allow for bisecting. During debugging, I actually made a script (now deleted) that had commands for all the new and old elbo sub-functions one after another. After coffee I can find it in the log.

Ryan

On Thu, Feb 4, 2016 at 9:33 AM, Jeffrey Regier notifications@github.com wrote:

The last commit where benchmark_elbo.jl runs fast has the comment Notes and the first commit where it's slow has the comment Setting up debugging for comparsion with the old version. About 20 commits in between that can't be tested easily. @rgiordan https://github.com/rgiordan -- Any suggestions for what commits in between are most likely to introduce the slowdown? I could use some leads....

— Reply to this email directly or view it on GitHub https://github.com/jeff-regier/Celeste.jl/issues/104#issuecomment-179958808 .

rgiordan commented 8 years ago

after coffee => during coffee. :) In commit 86018b, take a look at test/compare_celeste_versions.jl.

On Thu, Feb 4, 2016 at 10:08 AM, Ryan Giordano rgiordan@gmail.com wrote:

You're not going to be able to test it this way. I rewrote ElboDeriv.jl from the inside out, starting with the Hessians for the brightness and bivariate normal calculations, and then working out to the full ELBO. As soon as I started down that path, the end product, elbo_likelihood(), wouldn't run until I was finished.

Instead, maybe compare sub-components of the ELBO one at a time. That could lead to insights on its own, and will also allow for bisecting. During debugging, I actually made a script (now deleted) that had commands for all the new and old elbo sub-functions one after another. After coffee I can find it in the log.

Ryan

On Thu, Feb 4, 2016 at 9:33 AM, Jeffrey Regier notifications@github.com wrote:

The last commit where benchmark_elbo.jl runs fast has the comment Notes and the first commit where it's slow has the comment Setting up debugging for comparsion with the old version. About 20 commits in between that can't be tested easily. @rgiordan https://github.com/rgiordan -- Any suggestions for what commits in between are most likely to introduce the slowdown? I could use some leads....

— Reply to this email directly or view it on GitHub https://github.com/jeff-regier/Celeste.jl/issues/104#issuecomment-179958808 .

rgiordan commented 8 years ago

@kpamnany @kbarbary @jeff-regier

I'm digging into the strangely large amount of memory allocation in the ELBO, and I have a fundamental julia question. I dug into which functions are allocating the most memory and put a @time macro in front of each line. (The --track-allocation flag does not seem to be working in julia 0.5.) No line stands out as allocating an unusual amount of memory, but every single line allocates at least a little -- 64 to 80 bytes or so -- and it all adds up. For example, line 494 of ElboDeriv.jl, which doesn't do much, allocates 64 bytes or memory.

There's nothing special about the ELBO in this regard. Here is something kind of surprising to me:

julia> @time x = zeros(1);
  0.000004 seconds (5 allocations: 240 bytes)
julia> @time x[1] = 5
  0.000004 seconds (4 allocations: 160 bytes)

I expect the first line to allocate memory, since I'm defining a new variable. But why does the second line allocate memory? In C, it would not. Or is this an artifact of the @time macro?

jeff-regier commented 8 years ago

It might be because x is a mutable global variable. The two function calls below use the same amount of memory, even though only one assigns to x[1] .

julia> function test_local_assignment()
       x = zeros(1)
       x[1] = 5
       end
test_local_assignment (generic function with 1 method)

julia> function test_local_no_assignment()
       x = zeros(1)
       end
test_local_no_assignment (generic function with 1 method)

julia> test_local_assignment();

julia> test_local_no_assignment();

julia> @time test_local_assignment();
  0.000003 seconds (5 allocations: 240 bytes)

julia> @time test_local_no_assignment();
  0.000003 seconds (5 allocations: 240 bytes)

On Sun, Feb 21, 2016 at 4:01 PM Ryan notifications@github.com wrote:

@kpamnany https://github.com/kpamnany @kbarbary https://github.com/kbarbary @jeff-regier https://github.com/jeff-regier

I'm digging into the strangely large amount of memory allocation in the ELBO, and I have a fundamental julia question. I dug into which functions are allocating the most memory and put a @time https://github.com/time macro in front of each line. (The --track-allocation flag does not seem to be working in julia 0.5.) No line stands out as allocating an unusual amount of memory, but every single line allocates at least a little -- 64 to 80 bytes or so -- and it all adds up. For example, line 494 of ElboDeriv.jl, which doesn't do much, allocates 64 bytes or memory.

There's nothing special about the ELBO in this regard. Here is something kind of surprising to me:

julia> @time x = zeros(1); 0.000004 seconds (5 allocations: 240 bytes) julia> @time x[1] = 5 0.000004 seconds (4 allocations: 160 bytes)

I expect the first line to allocate memory, since I'm defining a new variable. But why does the second line allocate memory? In C, it would not. Or is this an artifact of the @time macro?

— Reply to this email directly or view it on GitHub https://github.com/jeff-regier/Celeste.jl/issues/104#issuecomment-186948286 .

kbarbary commented 8 years ago

Sounds like there's a type instability. Use @code_typed or @code_warntype with the function call and make sure there's no Any or ANYs present.

The allocation you're seeing in the last example is a result of running in the REPL or perhaps being in global scope (where type inference doesn't work so well). If you put @time inside a function, there's no allocation:

julia> x = zeros(1);

julia> f(x) = @time x[1] = 5
f (generic function with 1 method)

julia> f(x)
  0.000000 seconds
5

rgiordan commented 8 years ago

Cool, thank you. I used track-allocation with julia 0.4 and indeed there is no allocation in the function for many of the lines, so you're probably right an artifact of running in the REPL or the global scope.

kpamnany commented 8 years ago

The count reported by @time includes allocations made by codegen so running any new code for the first time could show non-zero allocations.

Line 494 appears to be a reasonably complicated expression, which could be requiring some temporaries to be allocated. Kyle's suggestion to look at the typed AST is probably best to see what's going on.

rgiordan commented 8 years ago

All right, I'm still a little confused about what's going on. The @time macro is giving results that are inconsistent with --track-allocation.

When I run @time on elbo_likelihood, I see about 87MB of allocation.

@time elbo_lik = ElboDeriv.elbo_likelihood(tiled_blob, mp);
  1.941018 seconds (1.39 M allocations: 86.825 MB, 1.40% gc time)

(This is also true if I wrap it in a function that defines local verisons of tiled_blob and mp.) When I look at the memory allocated by each of the sub-calls as measured by @time it all adds up. For example,

julia> @time get_expected_pixel_brightness!(...args...)
  0.001719 seconds (575 allocations: 36.734 KB)
julia> 36.734 * num_pixels
84488.2
# ...so this means get_expected_pixel_brightness! allocates 84MB of the 86MB from elbo_likelihood
julia> @time ElboDeriv.populate_fsm_vecs!(...args...)
  0.001139 seconds (458 allocations: 24.375 KB)
julia> @time ElboDeriv.combine_pixel_sources!(...args...);
  0.001791 seconds (120 allocations: 12.438 KB)
# ...so the 36k per pixel allocated by get_expected_pixel_brightness! is precisely accounted for by
#    the amount allocated by populate_fsm_vecs! and combine_pixel_sources! as expected.

However, when I drill down on these functions using memory allocation analysis, I do not see this memory being allocated. For example,

Profile.clear_malloc_data()
Profile.clear()
@profile for i=1:50000
  ElboDeriv.populate_fsm_vecs!(
    elbo_vars, mp, tile_sources, tile, h, w, sbs, gal_mcs, star_mcs)
end

...then the .mem files indicate zero memory being allocated. (Yes, the mem files are being correctly updated -- I can see them from the timestamps.) Specifically:

using Coverage
res = analyze_malloc("src");
sum([r.bytes for r in res]) / 1e6

...gives 0. I don't know what's going on, any ideas?

rgiordan commented 8 years ago

One more mystery. With the help of the (very useful) @code_warntype, I've managed to get rid of every line of memory allocation in combine_pixel_sources! and populate_fsm_vecs! as measured by --track_allocation except for one: the first line of add_sources_sf!, which is simply adding the values of the sensitive floats:

sf_all.v[1] = sf_all.v[1] + sf_s.v[1]

No matter what I do, this line allocates a ton of memory. If I eliminate the line (which is, of course, wrong for the model), the memory allocation simply moves to the next line. Furthermore, the time-based profiling with @profile indicates that essentially no time is getting spent on that line of code -- it's all in the actual floating point arithmetic, which makes sense.

I am starting to suspect that this excess memory allocation is a red herring when it comes to the performance discrepancy between the new and old versions.

rgiordan commented 8 years ago

BTW, I just noticed that I have been using --track-allocation=all rather than --track-allocation=user, which seems to be giving different results than above. So maybe that's the cause of the mystery...

rgiordan commented 8 years ago

Also relevant: https://github.com/JuliaLang/julia/issues/11753

rgiordan commented 8 years ago

I'm still seeing some very strange things. The julia docs seem pretty clear -- using all with track allocation should track strictly more memory allocation than user, which it demonstrably does not.

Also, the memory allocation results are showing large amounts of allocation on lines that cannot be allocating memory. They are the first line of a function that gets called a lot (not from the REPL -- within another function), but which only updates the element of a pre-allocated vector. Furthermore, the following line, which is identical, gets no memory allocation.

Does anyone see what I might be doing wrong? If not, I'm about ready to file a bug.

From the .mem file:

        - function eval_bvn_pdf_in_place!{NumType <: Number}(
        -     elbo_vars::ElboIntermediateVariables{NumType},
        -     bmc::BvnComponent{NumType}, x::Vector{Float64})
        - 
  1120000   elbo_vars.py1[1] =
        -     bmc.precision[1,1] * (x[1] - bmc.the_mean[1]) +
        -     bmc.precision[1,2] * (x[2] - bmc.the_mean[2])
        0   elbo_vars.py2[1] =
        -     bmc.precision[2,1] * (x[1] - bmc.the_mean[1]) +
        -     bmc.precision[2,2] * (x[2] - bmc.the_mean[2])
        0   elbo_vars.f_pre[1] =
        -     bmc.z * exp(-0.5 * ((x[1] - bmc.the_mean[1]) * elbo_vars.py1[1] +
        -                         (x[2] - bmc.the_mean[2]) * elbo_vars.py2[1]))
        - 
        0   true # return type
        - end

jeff-regier commented 8 years ago

I haven't used the memory profiler extensively, maybe @kbarbary or @kpamnany will weigh in?

Once you've ruled out type issues with @code_warntype, you might consider making a really simple test case, like a 1-pixel image with 1 light source, and comparing its runtime to the version tagged "0.0.1".

kbarbary commented 8 years ago

Sounds like you're just hitting the bug you linked with track-allocation mis-attributing?

If you push what you've done so far, I would be interested in taking a look in detail.

rgiordan commented 8 years ago

Ok, after a painstaking manual search I have tracked down what is going on -- assigning variables implicitly in arguments, including setting keyword arguments (!), has its memory allocation assigned to the line associated with the first actual computation that gets executed after the function gets called. In this case was further down the stack than the function with implicitly defined arguments.

Whether this is a bug is arguable, but it certainly was confusing.

rgiordan commented 8 years ago

By the way, I have probability theory until about 3pm, then I'll work on this until I leave for MX this evening. I've pushed to hessian_performance_julia4, though I think I now have a handle on what's going on.

rgiordan commented 8 years ago

Done in https://github.com/jeff-regier/Celeste.jl/pull/124

jeff-regier / Celeste.jl

performance regression according to benchmark_elbo.jl #104