improve performance of IndexAtom a little

ericphanson commented 2 months ago

This makes the issues highlighted in #614 less bad. It is still suboptimal (and IMO we should address #614 directly by removing our uses of scalar indexing), but it would still be nice if indexing wasn't quite so darn slow.

The issue is for each individual index we do a sparse matrix multiplication to extract just the rows of the underlying sparse matrix that we want. We can optimize this by doing indexing on that level, like A[i:i, :], but that is still pretty slow. What seems to be faster is to create an auxiliary variable to represent the output, and constraint each entry on the MOI level via ScalarAffineFunctions to the value it is supposed to be. This essentially lowers things more quickly to MOI, so at the convex level after indexing, we only have the output variables, and we have dropped all the inputs, rather than holding onto them until the very end to lower everything to MOI then.

With this, the problem in #254 is much improved over the status quo (~190 GB of allocations, 20min+ runtime, or OOM):

 27.692581 seconds (30.11 M allocations: 1.775 GiB, 7.83% gc time)

although still not as fast as in #613. So IMO this shows that this is a good optimization to make indexing a lot faster/allocation-heavy, while #614 is still valuable for this codebase.

~So far in this PR, I have only addressed one of the two forms of indexing, and only for real valued problems. If we like this approach though, it should be somewhat straightforward to extend to the others.~ I did these

Also, some of the tests are failing, since introducing a new auxiliary variable changes the problem formulations. I don't see any failures that look like an implementation issue though.

ericphanson commented 2 months ago

I could use some help figuring out how to update the tests, since I am not familiar with the new test_atoms stuff.

ericphanson commented 2 months ago

I think they will be fewer tests to update once #614 is done, since there will be less indexing in various atoms

odow commented 2 months ago

I'll take a look after we merge #614.

And yes, the variable order of the test_atoms is finicky :smile:. So moving the order in which atoms are conic_form!ed can make a difference.

odow commented 2 months ago

I'll rebase and take a look.

odow commented 2 months ago

I don't know about this one. It isn't obvious that we want to add every index operation as a scalar variable.

ericphanson commented 2 months ago

Good point, this is probably pushing work to solve time… on a technical level, when we do scalar indexing, we end up with a 1xn matrix. There’s two issues: row indexing of CSC matrices is slow, and we’d kind of like to stop composing forward with future operations, do do the rest of them, probably ending up with a 1x1 matrix, then compose that with our 1xn, instead of repeatedly updating our 1xn matrix with each operation.

Here, we don’t directly solve the first issue but we at least do the row indexing with minimal allocations, and we solve the second issue by starting a new variable and constraining it to be equal to the output of the old one.

To improve the row indexing thing, we could convert to another representation that has fast row indexing, but that’s super inefficient if we have to do it separately each time we index that matrix. We could probably do some caching though…

for the second, we could go back to a lazier representation. At first I tried a fully lazy representation where we keep each operation separate and only compose when we have to. That’s why it’s called a SparseTape, originally it had a vector of operations. I think it could make sense to do something in between, like only being lazy when it seems likely to be profitable. Maybe also with a linked list rather than a vector.

if I have time I will give it a try

ericphanson commented 2 months ago

I tried a few things without much luck. The current state now is a fairly simple change, just optimizing the indexing directly without adding any more caches or auxiliary variables etc.

It provides a ~2x speedup for indexing heavy-code and reduces allocations:

julia> using Convex

julia> import MathOptInterface as MOI

julia> let
           m = 10_000
           x = Variable(m, 1)
           constraints = []
           for i in 1:m
               push!(constraints, x[i] >= 0)
           end
           problem = minimize(sum(x), identity.(constraints))
           @time context = Convex.Context(problem, MOI.Utilities.Model{Float64})
       end;
  1.309713 seconds (621.66 k allocations: 2.269 GiB, 28.88% gc time) # master
  0.623630 seconds (431.66 k allocations: 1.513 GiB, 18.64% gc time) # PR

However if I backport the old/slow ExpAtom definition, I still OOM on the #254 problem:

using Convex, JLD2, MathOptInterface
import MathOptInterface as MOI
file = JLD2.load(joinpath("/Users/eph/Downloads", "cvx_example.jld2"))

function Convex.new_conic_form!(context::Convex.Context{T}, e::Convex.ExpAtom) where {T}
    # exp(x) \leq z  <=>  (x,1,z) \in ExpCone
    x = e.children[1]
    m, n = size(x)
    z = Variable(m, n)
    for i in 1:m, j in 1:n
        f = vcat(x[i, j], 1, z[i, j])
        add_constraint!(context, Convex.GenericConstraint{MOI.ExponentialCone}(f))
    end
    return Convex.conic_form!(context, z)
end

let
    # m = 2000
    # Adj = file["Adj"][1:m, :]
    # N = file["N"][1:m, :, :]

    Adj = file["Adj"]
    N = file["N"]
    x = Variable(2730)
    S, V, T = size(N)
    lN = log.(N)
    M = vec(sum(N[:, :, 2:T]; dims=(2, 3)))
    H = Adj * x
    problem = maximize(-sum(logsumexp(lN[:, v, t] - H) for t = 1:T-1, v = 1:V) - dot(M, H), [x ≥ -1e2, x ≤ 1e2])

    @time context = Convex.Context(problem, MOI.Utilities.Model{Float64});
end;

My goal with this PR was to make it so that if users use indexing the way ExpAtom used to, it would be more tolerable and e.g. not OOM on #254. But the only way I've found to do that so far is with the auxiliary variables, which maybe is just pushing the issue down the line to the solver to handle. Here I still OOM. I also tried an approach with an extra cache, where we cache copy(transpose(A)) inside _index, and then use it as copy(transpose(A_transpose[:, keep_rows])) instead of A[keep_rows, :]. This provides another ~2x speedup on these benchmark problems, but it doesn't really solve the issue (the full #254 problem still OOMs) and I don't think it's worth the downsides (including potentially causing performance issues elsewhere for problems that only index a little bit, but on a lot of different large objects).

I also tried a semi-lazy way, but I realized/remembered the issue with laziness is that we have to collapse it to a single operation as soon as we want to do do something like vcat or + that combines two different branches of the tree, as opposed to just extending the tree with another operation. So any amount of laziness just adds complexity and not much benefit, since we can't defer things very long.

So anyway, I can't really figure out a better approach here. I do think the auxiliary variable thing could be worth it in the end, but we would need to performance test it end-to-end to account for solve time, and that could be a different PR. I think this one is at least a strict improvement over the status quo, but indexing is still very slow.

codecov[bot] commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 97.80%. Comparing base (9b619a2) to head (0a991f1). Report is 5 commits behind head on master.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #615 +/- ## ========================================== - Coverage 97.88% 97.80% -0.09% ========================================== Files 88 88 Lines 5062 5095 +33 ========================================== + Hits 4955 4983 +28 - Misses 107 112 +5 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

odow commented 2 months ago

Let me take a look at the test failure.

jump-dev / Convex.jl

improve performance of IndexAtom a little #615

Codecov Report