gher-uliege / DIVAnd.jl

DIVAnd performs an n-dimensional variational analysis of arbitrarily located observations
GNU General Public License v2.0
70 stars 11 forks source link

[🐞]High memory consumption and inability to use divandgo or conjugate gradient inversion. #117

Closed josselin-aval closed 1 year ago

josselin-aval commented 1 year ago

This request follows a mail exchange with Aida Alvera Azcarate and Charles Troupin.

Please find below the description of our problem:

We would like to estimate a quantity on a spatio-temporal grid. In our last tests, the grid has a dimension of (216, 192, 25), corresponding to longitudes, latitudes and dates, respectively.

The observations come from satellite images (about 1 million observations). We have observations for the 14 first dates. These observations have holes corresponding to data that we deleted in the satellite images (for example clouds).

To obtain this grid, we need both to interpolate spatially on the known dates, and to extrapolate temporally and spatially on the unknown dates. DIVAnd seemed to be particularly adapted to this problem, in particular the function divandrun.

We first performed tests on simulations in order to get used to the tool, then we moved on to the real data mentioned above. Whether it is for the simulations or real data, the results seem consistent. But for the real data, we encounter problems of memory and computation time. Regarding memory, divandrun consumed up to about 100 GB of RAM in its default setting, while the final spatial coverage is only about 200 km square.

We realized that the divandgo function was specifically designed to handle this type of problem. However the results obtained are different from the results of divandrun. So we tested other inversion methods in the divandrun function. The preconditioned conjugate gradient significantly improves the memory problem, but the results are different and inconsistent with the standard usage. The conjugate gradient does not seem to converge (warning), which may explain these results. On the other hand, the multigrid method with preconditioned gradient encounters an error at runtime.

In order to improve these results, do you have any advice on the parameterization of DIVAnd? Do you have any advice on the use of DIVAnd in this context, especially on temporal extrapolation?

We remain at your disposal to give you further information and details.

ctroupin commented 1 year ago

Hello Josselin, maybe could you share with us the script you run to perform the interpolation with DIVAnd? (my mail if you prefer).

Are you running it on a laptop or another machine? Thanks

EDIT: when you say

the results are different and inconsistent with the standard usage...

can you be more precise, and maybe add a picture of the results? This can be due to the choice of the analysis parameters (correlation length, noise-to-signal ratio).

Alexander-Barth commented 1 year ago

We have observations for the 14 first dates.

Not sure to understand your use-case, are you trying to predict the parameter using DIVAnd for the remaing the 11 days ?

josselin-aval commented 1 year ago

Hello Charles and Alexander, thank you for your support.

maybe could you share with us the script you run to perform the interpolation with DIVAnd?

Please find the script below:

"""
julia "venv" in ~
"""

println("RUN DIVAND...\n")

#################################
########### préambule ###########
#################################

# https://github.com/fhs/NPZ.jl
using NPZ
using PyPlot
using DIVAnd

"""
parameters
"""

# paths
divand_parameters_folder = joinpath(@__DIR__, "divand_parameters")
divand_results_folder = joinpath(@__DIR__, "divand_results")

# parameters

# can sometimes be determined by physical arguments
len = (parse(Float64, ARGS[2]), parse(Float64, ARGS[2]), parse(Float64, ARGS[3]))
# very small error on the data in comparison to the climato: we trust the data
epsilon2 = parse(Float64, ARGS[4])

# variable
variable = ARGS[1]

# mode (test or work)
modes = ["test", "work"]

#############################
########### corps ###########
#############################

# load the parameters for divand grid
x_min = npzread(joinpath(divand_parameters_folder, "x_min.npy"))
x_max = npzread(joinpath(divand_parameters_folder, "x_max.npy"))
y_min = npzread(joinpath(divand_parameters_folder, "y_min.npy"))
y_max = npzread(joinpath(divand_parameters_folder, "y_max.npy"))

# load the parameters for divandrun
x = npzread(joinpath(divand_parameters_folder, "x_" * string(variable) * ".npy"))
y = npzread(joinpath(divand_parameters_folder, "y_" * string(variable) * ".npy"))
t = npzread(joinpath(divand_parameters_folder, "t_" * string(variable) * ".npy"))
f_minus_climato = npzread(joinpath(divand_parameters_folder, "f_minus_climato_" * string(variable) * ".npy"))

for i in 1:size(modes)[1]

    """
    load parameters
    """

    # load the parameters for divand grid
    t_observations = npzread(joinpath(divand_parameters_folder, "t_observations_" * string(modes[i]) * ".npy"))

    # load the parameters for divandrun
    mask_custom = npzread(joinpath(divand_parameters_folder, "mask_" * string(variable) * "_" * string(modes[i]) * ".npy"))

    """
    compute the final grid
    """

    # default grid is 1km step
    mask, (pm, pn, po), (xi, yi, ti) = DIVAnd_rectdom(
        range(x_min, stop = x_max, step = 1000),
        range(y_min, stop = y_max, step = 1000),
        t_observations
    )

    """
    apply divandrun
    """

    @time fi, s = DIVAndrun(mask_custom, (pm, pn, po), (xi,yi,ti), (x, y, t), f_minus_climato, len, epsilon2)

    npzwrite(joinpath(divand_results_folder, "fi_" * string(modes[i]) * "_" * string(variable) * ".npy"), fi)
    npzwrite(joinpath(divand_results_folder, "xi_" * string(modes[i]) * "_" * string(variable) * ".npy"), xi[:,:,1])
    npzwrite(joinpath(divand_results_folder, "yi_" * string(modes[i]) * "_" * string(variable) * ".npy"), yi[:,:,1])

end

println("RUN DIVAND DONE...\n")

You will need the input data (few tens of mo approximately) to run the script. How can I send you this data, directly on this interface?

As you will see, some julia modules need to be installed first.

Then the script is run with the following command in a terminal: julia apply_divand.jl $variable $len_spatial $len_temporal $epsilon2.

The results shown below have been obtained with the following arguments:

If epsilon2 needs to be further optimized, len_spatial and len_temporal have been determined by "physical arguments", 10000 m = 10 km, 5 days, corresponding to the desired spatial and temporal correlation lengths. However, it is possible to change these correlation lengths a bit, especially if this improves the memory consumption.

Are you running it on a laptop or another machine?

Another machine, a server.

can you be more precise, and maybe add a picture of the results? This can be due to the choice of the analysis parameters (correlation length, noise-to-signal ratio).

After running the script apply_divand.jl, we run a python script for the analysis of the results (we usually work with python). The results presented below are produced by this python script, based on the output of DIVAnd to which we add the background.

The first figure corresponds directly to the results of the script apply_divand.jl (work mode), while the second figure corresponds to the results of the script apply_divand.jl by replacing the line :

@time fi, s = DIVAndrun(mask_custom, (pm, pn, po), (xi,yi,ti), (x, y, t), f_minus_climato, len, epsilon2)

by the line :

@time fi, s = DIVAndrun(mask_custom, (pm, pn, po), (xi,yi,ti), (x, y, t), f_minus_climato, len, epsilon2, inversion=:pcg)

i.e. with a different inversion approach, especially aiming to improve memory consumption.

Figure 1

fusion_work_secchi

Figure 2

fusion_work_secchi

If the figure 1 shows consistent results in view of the input data, i.e. spatial gap filling and temporal extrapolation with an extension of the past trends, the figure 2 shows inconsistent results. In fact the results correspond to the background.

Not sure to understand your use-case, are you trying to predict the parameter using DIVAnd for the remaing the 11 days ?

Not 11 days but 11 times (not 11 dates, sorry for the confusion) which correspond of an oversampling of the 5 following days. In other words, we are trying to predict the parameter using DIVAnd for the next 5 days. In the figures above, we can see the time at the top of each thumbnail (in red the times where data is missing, the times we want to extrapolate).

As a conclusion, the results obtained with the standard inversion (Cholesky factorization) seem consistent in view of the input data, however this is very consuming in memory and will be problematic when we will process larger areas.

jmbeckers commented 1 year ago

Random thoughts:

Extrapolating of past trends will in all cases be a tricky thing and for longer lead times with DIVAnd you will retrieve the background field.

For the computational aspects, pcg might not converge since it uses no preconditionner.

You might try DIVAndjog instead of DIVAndrun, where a preconditionner is used, which you can parameterize by steps for a working on a coarser grid for the preconditionner (which uses the direct inversion) and the length scales multipliers.

Example in 2D

DIVAndjog(mask,(pm,pn),(xi,yi),(obslon,obslat),obsval.-mean(obsval),len,epsilon2,[4,4],[4,4];alphabc=2.0);

For your case I would try something like

DIVAndjog(...  epsilon2,[2,2,1],[1,1,0];alphabc=2.0);

that should make the preconditionner much less memory consuming than the full 3D analysis by using as a preconditionner a coarser grid with no time correlation.

(m2c ...)

I would no use DIVAndgo in this case, because in 3D I think it is hardcoded to read that as 3D in space and he will make a windowing in z direction (your time direction). You definitively need to keep correlations correctly in time direction.

josselin-aval commented 1 year ago

Hello @jmbeckers and thank you very much for your support and suggestions.

I will try that and keep in touch.

Alexander-Barth commented 1 year ago

Extrapolating of past trends will in all cases be a tricky thing and for longer lead times with DIVAnd you will retrieve the background field.

I had the same concern. DIVAnd would be somewhere between a persistence forecast and a linear extrapolation. It is really not designed to do a forecast (in time).

On the other hand, the multigrid method with preconditioned gradient encounters an error at runtime.

Can you give us the full error message (and stack trace), a minimal reproducer (with e.g. random data) and the output of import Pkg; Pkg.status() ?

jmbeckers commented 1 year ago

I can confirm the problem with the multigrid version:

using DIVAnd

# observations
x = rand(750);
y = rand(750);
z = rand(750);
f = sin.(x*6) .* cos.(y*6)+sin.(z*6) .* cos.(x*6) ;

# final grid
#
testsize=300
testsizez=4
xi,yi,zi = ndgrid(range(0,stop=1,length=testsize),range(0,stop=1,length=testsize),range(0,stop=1,length=testsizez));

# reference field
fref = sin.(xi*6) .* cos.(yi*6)+sin.(zi*6) .* cos.(xi*6);

# all points are valid points
mask = trues(size(xi));

# this problem has a simple cartesian metric
# pm is the inverse of the resolution along the 1st dimension
# pn is the inverse of the resolution along the 2nd dimension

pm = ones(size(xi)) / (xi[2,1,1]-xi[1,1,1]);
pn = ones(size(xi)) / (yi[1,2,1]-yi[1,1,1]);
po = ones(size(xi)) / (zi[1,1,2]-zi[1,1,1]);

# correlation length
len = 0.5;

# obs. error variance normalized by the background error variance
epsilon2 = 1.;
fiiter,s = DIVAndrun(mask,(pm,pn,po),(xi,yi,zi),(x,y,z),f,len,epsilon2;inversion=:cg_amg_sa);

leads to

97  6.90e+02
 98 6.94e+02
 99 7.08e+02
100 7.03e+02

MethodError: Cannot `convert` an object of type Float64 to an object of type Vector{Float64}
Closest candidates are:
  convert(::Type{Array{T, N}}, ::StaticArraysCore.SizedArray{S, T, N, N, Array{T, N}}) where {S, T, N} at C:\Users\jmbeckers\.julia\packages\StaticArrays\jA1zK\src\SizedArray.jl:88
  convert(::Type{Array{T, N}}, ::StaticArraysCore.SizedArray{S, T, N, M, TData} where {M, TData<:AbstractArray{T, M}}) where {T, S, N} at C:\Users\jmbeckers\.julia\packages\StaticArrays\jA1zK\src\SizedArray.jl:82
  convert(::Type{T}, ::LinearAlgebra.Factorization) where T<:AbstractArray at C:\Users\jmbeckers\AppData\Local\Programs\Julia-1.8.5\share\julia\stdlib\v1.8\LinearAlgebra\src\factorization.jl:58
  ...

Stacktrace:
 [1] *(C::CovarIS{Float64, SparseArrays.SparseMatrixCSC{Float64, Int64}}, v::Vector{Float64})
   @ DIVAnd c:\Users\jmbeckers\Documents\GitHub\DIVAnd.jl\src\special_matrices.jl:51
 [2] DIVAnd_solve!(s::DIVAnd.DIVAnd_struct{Float64, Int64, 3, SparseArrays.SparseMatrixCSC{Float64, Int64}}, fi0::Vector{Float64}, f0::Vector{Float64}; btrunc::Vector{Any})
   @ DIVAnd c:\Users\jmbeckers\Documents\GitHub\DIVAnd.jl\src\DIVAnd_solve.jl:32
 [3] DIVAndrun(operatortype::Type, mask::BitArray{3}, pmnin::Tuple{Array{Float64, 3}, Array{Float64, 3}, Array{Float64, 3}}, xiin::Tuple{Array{Float64, 3}, Array{Float64, 3}, Array{Float64, 3}}, x::Tuple{Vector{Float64}, Vector{Float64}, Vector{Float64}}, f::Vector{Float64}, lin::Float64, epsilon2::Float64; velocity::Tuple{}, primal::Bool, factorize::Bool, tol::Float64, maxit::Int64, minit::Int64, constraints::Tuple{}, inversion::Symbol, moddim::Vector{Any}, fracindex::Matrix{Float64}, alpha::Vector{Any}, keepLanczosVectors::Int64, compPC::typeof(DIVAnd_pc_none), progress::DIVAnd.var"#286#288", fi0::Array{Float64, 3}, f0::Vector{Float64}, alphabc::Float64, scale_len::Bool, btrunc::Vector{Any}, MEMTOFIT::Float64, topographyforfluxes::Tuple{}, fluxes::Tuple{}, epsfluxes::Int64, epsilon2forfractions::Int64, RTIMESONESCALES::Tuple{}, QCMETHOD::Tuple{}, coeff_laplacian::Vector{Float64}, coeff_derivative2::Vector{Float64}, mean_Labs::Nothing)
   @ DIVAnd c:\Users\jmbeckers\Documents\GitHub\DIVAnd.jl\src\DIVAndrun.jl:131
 [4] #DIVAndrun#290
   @ c:\Users\jmbeckers\Documents\GitHub\DIVAnd.jl\src\DIVAndrun.jl:298 [inlined]
 [5] top-level scope
   @ In[1]:34

for Julia 1.8.5 and my current DIVAnd version

C:\Users\jmbeckers\.julia\environments\v1.8\Project.toml`
  [2169fc97] AlgebraicMultigrid v0.5.1
  [efc8151c] DIVAnd v2.7.7 `c:/Users/jmbeckers/Documents/GitHub/DIVAnd.jl`
  [7073ff75] IJulia v1.24.0
  [f5a24dde] LimitedLDLFactorizations v0.5.0
  [d330b81b] PyPlot v2.11.0
  [9449cd9e] TSVD v0.4.3
jmbeckers commented 1 year ago

The multigrid method error seems to be in special_matrices.jl where the output of cg changes depending on the log flag, which needs to be true for the specified output

I will preparer the pull request

Alexander-Barth commented 1 year ago

Thanks a lot, @jmbeckers indeed, this is the issue. The current version already works if you set ENV["JULIA_DEBUG"] = "DIVAnd", for the updated version this will no longer be necessary.

Unfortunately, the multigrid method is quite slow to converge (and the norm of the residual does not decrease at every iteration). The number of iterations can be set using maxit:

ENV["JULIA_DEBUG"] = "DIVAnd"
fiiter,s = DIVAndrun(mask,(pm,pn,po),(xi,yi,zi),(x,y,z),f,len,epsilon2;inversion=:cg_amg_sa, maxit = a_large_number);

In any case, it might be good to set ENV["JULIA_DEBUG"] = "DIVAnd" to see the progress. What you see on the screen,then is the norm of the residual: https://github.com/JuliaLinearAlgebra/IterativeSolvers.jl/blob/4c4754c05c6c7dbc51a5c934fc2e841405075e97/src/cg.jl#L234

You would typically divide the norm of the residual by the the number of grid points, to have the error "per grid cell".

josselin-aval commented 1 year ago

Multigrid method

The multigrid method works by adding ENV["JULIA_DEBUG"] = "DIVAnd", thanks a lot.

I have tried the multigrid method with the per default maxit (100) and it indeed seems to be not enough. With a maxit equal to 300, the norm of the residuals decreases further and the results are more consistent with the standard inversion. I need to increase maxit further and identify a satisfying value.

The memory consumption is reduced by a factor of 10. However, increasing maxit will increase the computation time, but the main issue of memory should be significantly improved with the multigrid approach, thank you for your help.

DIVAndjog

In order to keep time sampling and following your suggestions, I have tested the following command:

@time fi, s = DIVAndjog(mask_custom, (pm, pn, po), (xi,yi,ti), (x, y, t), f_minus_climato, len, epsilon2, [2,2,1], [1,1,1], alphabc=2.0)

It seemed to me more appropriate to set lmask as [1,1,1] in order to keep the correlation lengths as defined in the previous DIVAndrun tests. However, I am not sure of the role of lmask. I have the impression, perhaps wrongly, that if the correlation lengths on the coarser grid are too different to the desired final ones, it may induce this preconditionner-based approach to give different results from the standard inversion. Can you give me more information on this?

I recall here the results of the standard inversion, for comparison:

fusion_work_secchi

Then the results of DIVAndjog are presented below:

fusion_work_secchi

which are significantly different from the standard inversion.

In order to better understand the principle of DIVAndjog, I have looked more precisely in the source code and it seems that another argument pcmethod can be specified (https://github.com/gher-uliege/DIVAnd.jl/blob/2f6b5f5f058ab58797979e0dc5b6c2f1343969f3/src/DIVAndjog.jl#L36). Can you give me more information about this argument and the 5 methods implemented in the code?

I have tested different values of pcmethod to see the impact on the results:

pcmethod=2

fusion_work_secchi

which are consistent with the standard inversion overall. The memory consumption is reduced by a factor of 5. This seems to be indeed an alternative to the multigrid approach. Once again thank you for your help.

pcmethod=3 and pcmethod=4 seem to give similar results to pcmethod=2, but with increasing computation time.

pcmethod=5

This case seems to not work:

ERROR: LoadError: BoundsError: attempt to access (3,)
  at index [2]
Stacktrace:
 [1] getindex(::Tuple, ::Int64) at ./tuple.jl:24
 [2] DIVAndjog(::Array{Bool,3}, ::Tuple{Array{Float64,3},Array{Float64,3},Array{Float64,3}}, ::Tuple{Array{Float64,3},Array{Float64,3},Array{Float64,3}}, ::Tuple{Array{Float64,1},Array{Float64,1},Array{Float64,1}}, ::Array{Float64,1}, ::Tuple{Float64,Float64,Float64}, ::Float64, ::Array{Int64,1}, ::Array{Int64,1}, ::Int64; rng::Random._GLOBAL_RNG, alphapc::Array{Any,1}, otherargs::Base.Iterators.Pairs{Symbol,Float64,Tuple{Symbol},NamedTuple{(:alphabc,),Tuple{Float64}}}) at /home/aval/.julia/packages/DIVAnd/MV3j9/src/DIVAndjog.jl:1252
 [3] macro expansion at ./timing.jl:174 [inlined]
 [4] top-level scope at /home/aval/Desktop/sawi_secchi/fusion/apply_divand/apply_divand.jl:83
 [5] include(::Function, ::Module, ::String) at ./Base.jl:380
 [6] include(::Module, ::String) at ./Base.jl:368
 [7] exec_options(::Base.JLOptions) at ./client.jl:296
 [8] _start() at ./client.jl:506
in expression starting at /home/aval/Desktop/sawi_secchi/fusion/apply_divand/apply_divand.jl:54

for Julia 1.5.3 and my current DIVAnd version

[deps]
DIVAnd = "efc8151c-67de-5a8f-9a35-d8f54746ae9d"
Distributed = "8ba89e20-285c-5b6f-9357-94700520ee1b"
JLD = "4138dd39-2aa7-5051-a626-17a0bb65d9c8"
NPZ = "15e1cf62-19b3-5cfa-8e77-841668bca605"
PackageCompiler = "9b87118b-4619-50d2-8e1e-99f35a4d4d9d"
Plots = "91a5bcdd-55d7-5caf-9e0b-520d859cae80"
PyPlot = "d330b81b-6aea-500a-939a-2ce795aea3ee"

Focusing on line 1252 of DIVAndjog.jl, https://github.com/gher-uliege/DIVAnd.jl/blob/2f6b5f5f058ab58797979e0dc5b6c2f1343969f3/src/DIVAndjog.jl#L1252, it seems to come from the size function which returns (3,) which cannot be accessed with [2]. Maybe there is something wrong in my parametrization.

jmbeckers commented 1 year ago

Hi,

Thanks for the feedback

The lmask array allows to decouple one or more directions for the PRECONDITIONNER only. That results in a much less dense covariance matrix (and less memory demand) for the latter. For your case, since you do an extrapolation, maybe it is indeed better to keep the correlation in the last direction in the preconditionner too. So you might even try [0, 0, 1]

For pcmethod 5 I really have no idea, it has been a long time that I used that. As you realize, finding the right preconditionner is not easy... I will try to have a look at pcmethod 5 but I do not think it will THE preconditionner.

jmbeckers commented 1 year ago

If I take out the test if size() ... for pcmethod=5 (I do not remember why it is there ...), the execution works but does not converge (though the direction is right).

josselin-aval commented 1 year ago

Thank you for your explanation and test.

the execution works but does not converge (though the direction is right).

With the cases pcmethod={2,3,4} which give consistent results with the standard inversion overall, I still had the warning indicating that the conjugate gradient does not converge.

Is it a question of tolerance, i.e. the tolerance is too restrictive for instance?

jmbeckers commented 1 year ago

Thank you for your explanation and test.

the execution works but does not converge (though the direction is right).

With the cases pcmethod={2,3,4} which give consistent results with the standard inversion overall, I still had the warning indicating that the conjugate gradient does not converge.

Is it a question of tolerance, i.e. the tolerance is too restrictive for instance?

Yes, probably it is very tight and not necessary to reach the "exact" solution. Once you found a reasonable preconditionner, you can always try to play with the tolerances.

jmbeckers commented 1 year ago

I close the issue since with the corrections, iterative solvers are available.

josselin-aval commented 1 year ago

Yes, all the points of this issue have been resolved, this is clear on our side. We have to continue now.

Thank you again for your support and quick responses, this was efficient!