Improving performace of static trace simulation

deoxyribose commented 3 years ago

Hello,

Thank you for writing this amazing PPL! I'm trying to implement a 2d version of the non-parametric change-point model from the Introduction to Modeling tutorial. The idea is to sample a tree structure that recursively divides an image into rectangles, and then sample pixel values for the whole image given the structure.

The tree generating function is fast enough:

@time trace = simulate(generate_segments, (Rectangle(Pixel(1,1),Pixel(648,1152)),))
  0.007940 seconds (80 allocations: 6.109 KiB)

However, the image sampling one is slow and uses a lot of memory:

@time trace = simulate(screen_model_dynamic, (size(tmp),))
  2.435734 seconds (13.55 M allocations: 459.870 MiB, 46.93% gc time)

@gen function screen_model_dynamic(size::Tuple{Int64,Int64})
    nrows, ncols = size
    screenshot = Array{Float64}(undef, nrows, ncols)
    node = @trace(generate_segments(Rectangle(Pixel(1,1),Pixel(nrows,ncols))))
    noise = @trace(gamma(1, 1), :noise)
    for i in 1:nrows
        for j in 1:ncols
            screenshot[i,j] = @trace(normal(get_value_at(i, j, node),noise), (:img, i, j))
        end
    end
    return screenshot
end

Hoping to speed it up and also make inference more efficient down the line, I tried to use the Map combinator to denote the independence of the pixels given the structure, and use the static modeling language:

@gen function pixel_kernel(i::Int64, j::Int64, tree::Node, noise::Float64)
    pixel = @trace(normal(get_value_at(i, j, tree),noise), :pix)
    return pixel
end 

function get_index_matrices(matrix::Array{Float64})
    indices = CartesianIndices(matrix)
    is = convert(Array{Int64}, getindex.(indices,1))
    js = convert(Array{Int64}, getindex.(indices,2))
    return is, js
end

@gen (static) function screen_model(size::Tuple{Int64,Int64})
    screenshot = Array{Float64}(undef, size)
    nrows,ncols = size
    tree = @trace(generate_segments(Rectangle(Pixel(1,1),Pixel(nrows,ncols))), :tree)
    noise = @trace(gamma(1, 1), :noise)
    is, js = get_index_matrices(screenshot)
    screenshot = @trace(Map(pixel_kernel)(is,js,fill(tree, size), fill(noise, size)), :img)
    return reshape(screenshot,size)
end

This turns out to require more time and memory, not less:

@time trace = simulate(screen_model, (size(tmp),))
  4.636526 seconds (33.38 M allocations: 2.418 GiB, 44.33% gc time)

Profiling shows that a lot of time is spent on garbage collection and dynamic dispatch. I'm new to both Gen and Julia and I don't really know how to improve those. Perhaps it's related to issue #173? Hopefully I'm just doing something wrong :) Here's the full code.

Any help or comments are much appreciated!

georgematheos commented 3 years ago

Hi @deoxyribose, thanks for sharing! As you're noticing, there is still work to do on improving performance in the static DSL. I remember that currently the Map combinator is somewhat inefficient, and usually allocates more memory than it needs to and requires dynamic type inference. Fixing this is a TODO on my mind; however, I would still have expected it to beat the dynamic DSL here.

One thought that may help: do things improve if you make pixel_kernel static as well?

If not, if you share a profile output I can take a look at what's taking the time, and see if anything occurs to me to improve this.

ztangent commented 3 years ago

This sounds like a really neat application of Gen! If you're not planning on constraining individual pixel values, but only the the entire image at one go, another way of writing screen_model_dynamic would be to simultaneously sample all pixel values from a multivariate normal distribution, and then reshape the resulting vector. This might reduce the overhead associated with having a large trace structure with thousands of individual random choices, because instead you'll have a trace with just two random choices: one for the tree, and am vector valued random choice for the image.

deoxyribose commented 3 years ago

Thanks for the answers @georgematheos and @ztangent.

Making pixel_kernel static helps quite a bit:

@time trace = simulate(screen_model, (size(tmp),))
  2.841029 seconds (6.95 M allocations: 1.809 GiB, 62.85% gc time)

Output from Profile.print(format=:tree) here and Profile.print(format=:flat, sortedby=:count) here

Nice idea with sampling the whole image from a multivariate normal! I tried it like this:

@gen function screen_model_dynamic2(size::Tuple{Int64,Int64})
    nrows, ncols = size
    screenshot = Array{Float64}(undef, nrows, ncols)
    tree = @trace(generate_segments(Rectangle(Pixel(1,1),Pixel(nrows,ncols))), :tree)
    noise = @trace(gamma(1, 1), :noise)
    is,js = get_index_matrices(screenshot)
    mean_img = map((i,j) -> get_value_at(i,j,tree), is, js)
    cov_mat = sparse(1:sizeof(mean_img),1:sizeof(mean_img),fill(noise, sizeof(mean_img)))
    screenshot = @trace(mvnormal([mean_img...],cov_mat), :img)
    return reshape(screenshot,size)
end

I was hoping to avoid memory issues by making the covmat sparse, but I guess mvnormal makes it dense, so I get an OutOfMemoryError. Maybe writing an mvnormal for scaled identity covmats could salvage the approach? Thanks again :)

ztangent commented 3 years ago

Oops! Maybe try broadcasted_normal, which does exactly what you asked! Not well documented unfortunately:

https://github.com/probcomp/Gen.jl/blob/master/src/modeling_library/distributions/normal.jl

deoxyribose commented 3 years ago

Fantastic, that works great!

@time trace = simulate(screen_model_dynamic2, (size(tmp),))
  0.282431 seconds (4.89 M allocations: 222.771 MiB, 11.58% gc time)

Thanks a lot!

probcomp / Gen.jl

Improving performace of static trace simulation #388