Closed cossio closed 5 months ago
@ararslan I'm now seeing the memory spike in the solve phase instead of the formulation. I'll close this unless something similar comes up again. I'll open an issue at ECOS or SCS.
Okay, thanks for reporting back. Did you see the spike only with a particular solver? If so, it would be good to know which.
@ararslan I just tested with ECOS and SCS, both consume > 4GB RAM (at which point I cancelled to avoid having to reboot my laptop again)
If it's both of them, that suggest to me that the issue may still be Convex's fault. Have you by chance done any profiling using the Profile stdlib package?
@ararslan Problem is my laptop dies before I can complete the profiling. Any suggestions?
Oh that makes sense. Hm, I don't know what to suggest offhand... I wonder if maybe the work could be put into a Task
then after a certain amount of time, send an interrupt to the task. Not sure if that would work though.
I just tried limiting to 10GB ram at a server and it also crashes. Both SCS and ECOS.
I don't know how to do the Task
thing, and I'm really out of ideas on how to diagnose what's going on.
In case it's really an issue with Convex I'll reopen this. Maybe someone else thinks of something. But feel free to close you think so.
@ararslan I started julia with a mem constrain of 10 GB using:
timeout -m 10000000 julia --track-allocation=user
This tracked memory allocations across all of Convex.jl as the RAM usage increased, until it was killed at 10GB. The .mem files are at https://github.com/cossio/cvx_example_data/blob/master/convex_allocations.tar.gz.
I ran this example with SCS. The .mem files of SCS.jl are at https://github.com/cossio/cvx_example_data/blob/master/SCS_allocation_analysis.tar.gz
Let me know if that helps, or if I can do anything else.
BTW, is there an utility to explore a directory tree like this with many .mem files to find the highest allocs?
BTW, is there an utility to explore a directory tree like this with many .mem files to find the highest allocs?
Coverage can do this, actually.
julia> using Coverage
julia> mem = analyze_malloc(".");
julia> mem[end-10:end]
11-element Array{Coverage.MallocInfo,1}:
Coverage.MallocInfo(9785470, "./variable.jl.mem", 22)
Coverage.MallocInfo(10224350, "./constant.jl.mem", 45)
Coverage.MallocInfo(14402371, "./constant.jl.mem", 54)
Coverage.MallocInfo(15131666, "./atoms/affine/add_subtract.jl.mem", 124)
Coverage.MallocInfo(22798259, "./atoms/affine/add_subtract.jl.mem", 87)
Coverage.MallocInfo(23872808, "./utilities/show.jl.mem", 20)
Coverage.MallocInfo(31561657, "./constant.jl.mem", 22)
Coverage.MallocInfo(47125096, "./expressions.jl.mem", 65)
Coverage.MallocInfo(120645411, "./conic_form.jl.mem", 112)
Coverage.MallocInfo(145535180, "./expressions.jl.mem", 106)
Coverage.MallocInfo(916530186, "./solution.jl.mem", 12)
Looks like the source of the allocations in solution.jl
is:
- function solve!(problem::Problem,
- s::MathProgBase.AbstractMathProgSolver;
- kwargs...)
- # TODO: warn about wiping out old model if warmstart=true?
916530186 problem.model = MathProgBase.ConicModel(s)
0 return solve!(problem; kwargs...)
- end
@ararslan Thanks for the suggestion, Coverage helps.
I inspected the .mem files for MathProgBase (https://github.com/cossio/cvx_example_data/blob/master/mpb_allocation.tar.gz) and they show no allocations. Then I realized the solvers implement their own versions of ConicModel(solver)
.
In ECOS.jl
there is this line that stands out. Seems like this is an issue with the solvers after all.
- function ver()
100391444 ver_ptr = ccall((:ECOS_ver, ECOS.ecos), Ptr{UInt8}, ())
- return unsafe_string(ver_ptr)
- end
It is interesting though, this is function is just to return ECOS
version?
Hm, okay, that's bizarre. You said you encountered the same problem with SCS? Do you have memory allocations for that?
@ararslan Yes, https://github.com/cossio/cvx_example_data/blob/master/scs_allocation.tar.gz.
julia> mem[end-10:end]
11-element Array{Coverage.MallocInfo,1}:
Coverage.MallocInfo(0, "./MPBWrapper.jl.mem", 387)
Coverage.MallocInfo(0, "./MPBWrapper.jl.mem", 388)
Coverage.MallocInfo(0, "./MPBWrapper.jl.mem", 389)
Coverage.MallocInfo(0, "./MPBWrapper.jl.mem", 397)
Coverage.MallocInfo(0, "./SCS.jl.mem", 22)
Coverage.MallocInfo(0, "./SCS.jl.mem", 23)
Coverage.MallocInfo(1248, "./MPBWrapper.jl.mem", 36)
Coverage.MallocInfo(1440, "./SCS.jl.mem", 21)
Coverage.MallocInfo(129136, "./MPBWrapper.jl.mem", 31)
Coverage.MallocInfo(4402990, "./SCS.jl.mem", 20)
Coverage.MallocInfo(79146051, "./c_wrapper.jl.mem", 124)
Again a version call:
- function SCS_version()
79146051 return unsafe_string(ccall((:scs_version, SCS.direct), Cstring, ()))
- end
julia --track-allocations
counts the allocations made by ccall
s?
I'm not sure, actually, though it would appear so. Are these version checks actually called somewhere during the solving process?
@ararslan I think these solvers print a version banner before starting the iterations. But I'm not sure if that's what we are seeing.
If I'm not mistaken, it looks like printing banners is done by the C code itself rather than by the Julia wrappers, so I don't think that would be reflected in the allocations tracked by Julia for the version querying Julia function.
If you're trying to isolate Convex.jl's processing from the solver, you can call conic_problem
to generate the input without solving, E.g., https://github.com/JuliaOpt/ConicBenchmarkUtilities.jl/blob/4267bfedeace99fcc4a03256eef8d44151f38b95/src/convex_to_cbf.jl#L4.
Thanks @mlubin! I'm seeing the allocations again, just from the call to conic_problem
. So we are back to this not being not an issue with the solvers. Here is what I got.
timeout.pl -m 10000000 julia --track-allocation=user
julia> include("cvx_issue.jl")
julia> con = Convex.conic_problem(problem) # huge memory alloc
where cvx_issue.jl
is this file:
using Convex, FileIO, JLD2, ECOS, SparseArrays, LinearAlgebra
@load "cvx_example.jld2"
x = Variable(2730)
S, V, T = size(N)
lN = log.(N)
M = vec(sum(N[:,:,2:T]; dims=(2,3)))
H = Adj * x
problem = maximize(-sum(logsumexp(lN[:,v,t] - H) for t = 1:T-1, v = 1:V) - dot(M, H), [x ≥ -1e2, x ≤ 1e2])
and cvx_example.jld
is here: https://github.com/cossio/cvx_example_data.
Inspecting the resulting .mem
files from all my packages (using Coverage
), gives this:
julia> analyze_malloc(".")[end-20:end]
21-element Array{Coverage.MallocInfo,1}:
Coverage.MallocInfo(13571105, "./JLD2/KjBIK/src/groups.jl.mem", 92)
Coverage.MallocInfo(15278217, "./Convex/MLj0c/src/atoms/affine/add_subtract.jl.mem", 124)
Coverage.MallocInfo(15407963, "./Convex/MLj0c/src/constant.jl.mem", 54)
Coverage.MallocInfo(15721748, "./Convex/MLj0c/src/constant.jl.mem", 22)
Coverage.MallocInfo(15917995, "./JLD2/KjBIK/src/JLD2.jl.mem", 334)
Coverage.MallocInfo(19283630, "./JLD2/KjBIK/src/loadsave.jl.mem", 2)
Coverage.MallocInfo(20910662, "./JLD2/KjBIK/src/JLD2.jl.mem", 352)
Coverage.MallocInfo(22778265, "./JLD2/KjBIK/src/data.jl.mem", 70)
Coverage.MallocInfo(24140413, "./Convex/MLj0c/src/constraints/constraints.jl.mem", 143)
Coverage.MallocInfo(26433226, "./Convex/MLj0c/src/atoms/affine/add_subtract.jl.mem", 87)
Coverage.MallocInfo(43264136, "./Compat/HVYNa/src/Compat.jl.mem", 310)
Coverage.MallocInfo(46994104, "./Convex/MLj0c/src/expressions.jl.mem", 65)
Coverage.MallocInfo(61731328, "./JLD2/KjBIK/src/data.jl.mem", 252)
Coverage.MallocInfo(79002811, "./ECOS/e0Lr6/src/ECOS.jl.mem", 27)
Coverage.MallocInfo(88962353, "./JLD2/KjBIK/src/JLD2.jl.mem", 399)
Coverage.MallocInfo(95562654, "./JLD2/KjBIK/src/loadsave.jl.mem", 57)
Coverage.MallocInfo(107061059, "./JLD2/KjBIK/src/JLD2.jl.mem", 288)
Coverage.MallocInfo(133353998, "./Convex/MLj0c/src/conic_form.jl.mem", 112)
Coverage.MallocInfo(151016207, "./Convex/MLj0c/src/expressions.jl.mem", 106)
Coverage.MallocInfo(301982888, "./Convex/MLj0c/src/expressions.jl.mem", 120)
Coverage.MallocInfo(351788459, "./JLD2/KjBIK/src/data.jl.mem", 7)
I ignore JLD2
since that's just loading the data (I presume). Here are the most interesting lines from expressions.jl
:
- ### User-defined Unions
- const Value = Union{Number, AbstractArray}
- const ValueOrNothing = Union{Value, Nothing}
- const AbstractExprOrValue = Union{AbstractExpr, Value}
-
151016207 convert(::Type{AbstractExpr}, x::Value) = Constant(x)
- convert(::Type{AbstractExpr}, x::AbstractExpr) = x
-
- function size(x::AbstractExpr, dim::Integer)
- if dim < 1
- error("dimension out of range")
- elseif dim > 2
- return 1
- else
- return size(x)[dim]
- end
- end
-
- ndims(x::AbstractExpr) = 2
301982888 get_vectorized_size(x::AbstractExpr) = reduce(*, size(x))
- lastindex(x::AbstractExpr) = get_vectorized_size(x)
and from conic_form.jl
- function has_conic_form(conic_forms::UniqueConicForms, exp::AbstractExpr)
133353998 return haskey(conic_forms.exp_map, (exp.head, exp.id_hash))
- end
I'm not that familiar with this code to figure what the issue might be. @ararslan any ideas?
The top ten allocations just from Convex.jl
:
julia> analyze_malloc("./Convex/")[end-10:end]
11-element Array{Coverage.MallocInfo,1}:
Coverage.MallocInfo(9620766, "./Convex/MLj0c/src/variable.jl.mem", 22)
Coverage.MallocInfo(10386046, "./Convex/MLj0c/src/utilities/show.jl.mem", 20)
Coverage.MallocInfo(15278217, "./Convex/MLj0c/src/atoms/affine/add_subtract.jl.mem", 124)
Coverage.MallocInfo(15407963, "./Convex/MLj0c/src/constant.jl.mem", 54)
Coverage.MallocInfo(15721748, "./Convex/MLj0c/src/constant.jl.mem", 22)
Coverage.MallocInfo(24140413, "./Convex/MLj0c/src/constraints/constraints.jl.mem", 143)
Coverage.MallocInfo(26433226, "./Convex/MLj0c/src/atoms/affine/add_subtract.jl.mem", 87)
Coverage.MallocInfo(46994104, "./Convex/MLj0c/src/expressions.jl.mem", 65)
Coverage.MallocInfo(133353998, "./Convex/MLj0c/src/conic_form.jl.mem", 112)
Coverage.MallocInfo(151016207, "./Convex/MLj0c/src/expressions.jl.mem", 106)
Coverage.MallocInfo(301982888, "./Convex/MLj0c/src/expressions.jl.mem", 120)
So this is interesting:
- function has_conic_form(conic_forms::UniqueConicForms, exp::AbstractExpr)
133353998 return haskey(conic_forms.exp_map, (exp.head, exp.id_hash))
- end
Here conic_forms.exp_map
is an
OrderedDict{Tuple{Symbol,UInt64},
OrderedDict{UInt64,Tuple{Union{AbstractArray,Number},
Union{AbstractArray,Number}}}}
I would not be surprised if haskey
on this type is horrendously inefficient.
@ararslan The fact that the value-type of the OrderedDict
is not concrete (Union{AbstractArray,Number}
) may not be ideal. Can this type be more specific? (just throwing a random idea out there)
Unfortunately not without a huge undertaking to restructure the way Convex uses and expresses types. For reference, Value
in Convex is actually Union{AbstractArray,Number}
. (As an aside, because of this, Convex actually overwrites a lot of Base methods, since it defines methods on Value
...)
I would not be surprised if
haskey
on this type is horrendously inefficient.
I can imagine it being slow but it absolutely should not be allocating much memory. I recommend benchmarking the functions identified here to check if they are actually problems.
Remember this is total memory allocated, so functions that are called a lot will report higher numbers.
These are also bytes allocated, so 301982888
(the highest number) is ~300 MB, which is not massive.
I'm suspicious that compilation memory usage may be in play.
@iamed2 It seemed strange to me that my system reported more than 10GB allocations, but the .mem
files were reporting lower values. So julia --track-allocations=user
misses all the allocations made by the JIT compiler? If that's so, what can I do to measure compiler allocs?
I'm saying the opposite; JIT allocations are included so you can't draw conclusions about the code in a function from the allocations number unless you filter them out.
From the docs:
In interpreting the results, there are a few important details. Under the user setting, the first line of any function directly called from the REPL will exhibit allocation due to events that happen in the REPL code itself. More significantly, JIT-compilation also adds to allocation counts, because much of Julia's compiler is written in Julia (and compilation usually requires memory allocation). The recommended procedure is to force compilation by executing all the commands you want to analyze, then call Profile.clear_malloc_data() to reset all allocation counters. Finally, execute the desired commands and quit Julia to trigger the generation of the .mem files.
I see, I misunderstood then. Then this does not explain why the system reports > 10 GB allocations.
I ran the code with @profile
and interrupted it after a bit, then leafed through the results. Anecdotally it's spending a fair amount of time on sparse matrix operations, but I'm not sure that's a primary offender.
This bug protects itself. It kills your computer before showing itself. I tried Adj = Matrix(Adj)
(which should avoid sparse operations), but got the same huge allocations.
I tried that as well, but I'll also note that Convex uses sparse matrices in a number of places regardless of whether any particular input is sparse:
$ sift -e sparse src
src/variable.jl:77: return sparse(1.0I, vec_size, vec_size)
src/variable.jl:83: return im*sparse(1.0I, vec_size, vec_size)
src/conic_form.jl:13:# we store the affine functions in this form for efficient manipulation of sparse affine functions
src/atoms/affine/diagm.jl:64: coeff = sparse(I, J, 1.0, sz * sz, sz)
src/atoms/affine/transpose.jl:63: transpose_matrix = sparse(I, J, 1.0)
src/atoms/affine/transpose.jl:126: transpose_matrix = sparse(I, J, 1.0)
src/atoms/affine/partialtrace.jl:47: a = sparse(1.0I, 1, 1)
src/atoms/affine/partialtrace.jl:48: b = sparse(1.0I, 1, 1)
src/atoms/affine/partialtrace.jl:58: a = kron(a, sparse(1.0I, dim, dim))
src/atoms/affine/partialtrace.jl:59: b = kron(b, sparse(1.0I, dim, dim))
src/atoms/affine/partialtrace.jl:90: a = sparse(1.0I, 1, 1)
src/atoms/affine/partialtrace.jl:91: b = sparse(1.0I, 1, 1)
src/atoms/affine/partialtrace.jl:101: a = kron(a, sparse(1.0I, dim, dim))
src/atoms/affine/partialtrace.jl:102: b = kron(b, sparse(1.0I, dim, dim))
src/atoms/affine/index.jl:68: index_matrix = sparse(1:sz, J, 1.0, m, n)
src/atoms/affine/index.jl:70: index_matrix = sparse(1:length(x.inds), x.inds, 1.0, m, n)
src/atoms/affine/multiply_divide.jl:85: objective = kron(sparse(1.0I, x.size[2], x.size[2]), x.children[1].value) * objective
src/atoms/affine/multiply_divide.jl:89: objective = kron(x.children[2].value', sparse(1.0I, x.size[1], x.size[1])) * objective
src/atoms/sdp_cone/operatornorm.jl:73: p = minimize(t, [t*sparse(1.0I, m, m) A; A' t*sparse(1.0I, n, n)] ⪰ 0)
I have a local branch of Convex that uses FillArrays and LazyArrays in place of SparseArrays where it makes sense to do so. Running the example in this issue, I'm seeing a peak memory usage of about 15%, which is down from 98% or so using SparseArrays alone.
Edit: It's still running after 12 minutes, and the memory usage has increased to 16%.
Every time I've interrupted the computation, it seems to be in the middle of this: https://github.com/JuliaOpt/Convex.jl/blob/master/src/conic_form.jl#L62-L71. Note in particular the comment that says it's time consuming.
I also encounter in many "Out of Memory" cases even though I have 32 [GB] of memory and I try to solve a quad form of 40000 x 40000
matrix with density of ~15%
.
Is it something I should expect Convex
to handle or not?
I tried both with SCS
and Gurobi
and I think it crashes before calling the solver.
It's been a long time since I tried this. But from what I recall, I think the issue was in the problem setup part, so not related to the solver.
@cossio , I agree. I think something in the problem formulation make it un happy with medium size problems.
I experimented with a more efficient representation in https://github.com/jump-dev/Convex.jl/pull/393 and had some success at least in reducing formulation time (don’t think I measured memory consumption), but I don’t do much optimization myself anymore (and never really needed to solve big problems, just weird ones like mixed-integer SDPs or high precision SDPs), so I don’t know if I’ll ever really have time to work more on that.
Happy to provide code review and guidance if anyone wants to pick that up or try other approaches.
I took a look at this on the branch https://github.com/jump-dev/Convex.jl/pull/504/ but it didn't seem any better. On that branch,
using Convex
m = 33378 # number of rows in Adj
Adj = rand(m, 2730)
x = Variable(2730)
H = Adj * x
problem = minimize(sum(H))
# Load the problem into an MOI model
@time context = Convex.Context(problem, MOI.Utilities.Model{Float64}())
shows the problem already. Reducing m
to a few thousand makes it much faster. It spends most of its time forming the sparse matrices or computing the kron
products.
I re-added a problem-specific conic-form cache (which I had dropped in my PR) and switched to SuiteSparseGraphBLAS for sparse arrays, did some optimization, and managed to formulate this in time
1015.189386 seconds (37.91 M allocations: 215.044 GiB, 93.00% gc time)
on 4311f1946a0eedc76868cc607daa64515694acd8 with the script
using Convex, Clarabel, JLD2
import MathOptInterface as MOI
file = JLD2.load(joinpath("/Users/eph/Downloads", "cvx_example.jld2"))
let
# m = 2000
# Adj = file["Adj"][1:m, :]
# N = file["N"][1:m, :, :]
Adj = file["Adj"]
N = file["N"]
x = Variable(2730)
S, V, T = size(N)
lN = log.(N)
M = vec(sum(N[:, :, 2:T]; dims=(2, 3)))
H = Adj * x
problem = maximize(-sum(logsumexp(lN[:, v, t] - H) for t = 1:T-1, v = 1:V) - dot(M, H), [x ≥ -1e2, x ≤ 1e2])
@time context = Convex.Context(problem, MOI.Utilities.Model{Float64}())
end
This is a lot better than before, but it still seems slow and obviously GC heavy. It finished on my 16 GB m1 macbook pro but the memory pressure was high for most of it.
Note also in an earlier run, I ran into GC issues where once I had tried a big run like that once, even small problems took a long time with 99% time in GC. The let
block here is to try to combat that.
edit: In 8b001255778312fc7714aec2c0f3b33ae45ba0ee I switched back to SparseArrays to avoid some GC corruption issues and other bugs, and it regressed to
1303.736478 seconds (11.94 M allocations: 191.588 GiB, 92.49% gc time)
which is still at least "formulatable".
I have an example of a moderately sized problem where problem formulation is consuming massive memory. I have no idea why this happens. Is there something inherently difficult to this kind of problem? Or is this a bug in Convex.jl?
Note that this is not an issue with any solver. The RAM spike occurs at the problem formulation step, the line
problem = maximize(...)
below.To run this code, you need to load the variables
N, Adj
from the.jld2
file (using JLD2.jl) here:https://github.com/cossio/cvx_example_data
N
is an float 3-dimensional Array, andAdj
a sparse matrix.WARNING: The following code took so much memory that I had to reboot my laptop. Run this in a contained environment where you can constrain memory consumption. (For example, using https://github.com/pshved/timeout, start julia with
timeout -m 4000000 julia
to limit memory usage to 4GB.)UPDATE: I'm now getting the RAM spike at the solve phase, instead of the formulation phase. Both ECOS and SCS have the same problem.