gher-uliege / DIVAnd.jl

DIVAnd performs an n-dimensional variational analysis of arbitrarily located observations
GNU General Public License v2.0
70 stars 11 forks source link

Calculated CPME error is unrealistically low! #95

Closed Leon6j closed 1 year ago

Leon6j commented 2 years ago

It seems the calculated cpme errors are unrealistically low.

I have a variable with a global average of around 2000 umol/kg. The mean of its estimated cpme errors from DIVAnd.jl is 7 x 10^-4 umol/kg.

The best measurement error for this variable is ~5 umol/kg. I would assume the gridding error would be significantly higher than the measurement error, but in reality the cpme error seems to be out of range. I would trust these estimate errors if they are 10000 times higher. Am I missing anything?

Many thanks!

jmbeckers commented 2 years ago

Thanks for the message. It is not quite clear if you have taken into account for the fact that error fields in divand are relative error fields, so scaled by the background variance (so you get values between 0 and 1).

Also note that an error field can be lower than the observational error if you have a lot of independent observations nearby.

That said, cpme is indeed underestimating the error when there are a lot of observations.

Do you work with diva3d or with divandrun ? In the latter case you might want to try out our new version DIVAnd_errormap which tries to use the most adequate error estimate based on your data distribution.

Leon6j commented 2 years ago

Many thanks for the reply!

No, I haven't factored in the fact that the cpme error is a relative value between 0 and 1. So how do I convert the cpme output into actual error values in my reported units of umol/kg in this case? I would have to go through each of the grid point and multiply the cpme error with the actual range of this variable, i.e., max(Variable) - min (Variable)? What if a few min and max outliers are tipping off the balance? Is there a reason for not giving us the actual error values?

Below is my current command to calculate the error: s fi,s = DIVAndrun(mask,(pm,pn),(X,Y),(x,y),f,len,epsilon2;alphabc=2);

If I migrate to DIVAnd_errormap, what would be the new syntax? Is the output of DIVAnd_errormap also a relative value between 0 and 1?

Many thanks!

croachutas commented 2 years ago

You need to compute the background/apriori error variance and use that to scale the output of CPME. I don't know the exact formulation the DIVAnd developer team uses but it should be something along the lines of (per Wong et al. 2001, Journal of Climate):

var_BG = (1/2n) sum(d_i -d_j), (1)

where d_i, d_j denote pairs of closely spaced observations and n denotes the total number of observations.

Also, CPME, if I recall correctly only works on the variance terms and ignores the contribution of covariance to the error. To deal with that you actually need to extract the inverse relative covariance matrix from the structure, s, created by DIVAnd. Will need to dig back through my code to double check exactly how to do that.

Error propagation to compute the total error for a specific domain (depth level, region, etc) including covariance can be then done as (e.g. Tellinghuisen 2001, J. Phys. Chem. A):

var = g^T C g

Where g is the data vector (here, data at grid-points), C is the covariance matrix and ^T denote a transpose. Since we're dealing with relative variance we can treat g as a vector of 1s and .multiply through by the backround varaince (obtained per (1) or similar):

var = var_BG.g^T C g (2)

Of cause, problem is DIVAnd stores the (sparse) inverse covar matrix, inv(C), not C itself, as C would be f###ing massive, dense matrix, so a right pain in the arse to store. This, equally, stops brute force inversion (doing inv(inv(**C*))...) being practical on a large scale (though, breaking the domain down into overlapping windows, computing C** within each window and evaluating (2) can work if correlation scales are short enough).

So, the issue now is how you wrangle the Cg term in (2) to something that can be solved with what we actually have (g, inv(C)).

Let:

Cg=x

If we multiply both sides by inv(C):

inv(C).Cg=inv(C)x,

But inv(C).C is the identity matrix I (by definition) and Ig = g, again, by definition. So, we obtain:

g=inv(C)x

Where we know g and inv(C) and want to find x.

Now we've got an equation we can solve to obtain x (which, remember, is the Cg term in (2)). The "easy" way is to throw something like matlab's mrdivide function at it and let the function chew through to select and apply appropriate solvers/factorizations/etc. developed by smarter mathematicians. I understand an equivalent exists in Julia , but it may be poorly documented and may not support as wide an array of solvers and factorization as matlab.

So: var = var_BG.g^T.mrdivide(g,inv(C)) (3)

If you want things as confidence intervals instead of just variance you'll also need the effective degrees of freedom which from the trace of the H matrix (per Janson et al, 215, Biometrika), another thing that's stored in the s structure DIVAnd spits out. If you're lazy and pessimistic you can just resort to the Chebyshev’s inequality (Tchebichef, 1867), in which case 95% CIs correspond tot he ±4.4721 standard deviation range...

P.S. Also note that if you have regions isolated from observational data (e.g. you've got narrow coastal features which are only intermittently resolved at the resolution you're using) then variance estimates for those isolated grid-points can produce nonsense.

Leon6j commented 2 years ago

Many thanks! That seems like a very complex calculation.

I wonder if the DIVAnd development team could build that into the program. As of now, it is very hard to justify using DIVAnd for publications when the reviewers are almost guaranteed to ask the questions of gridding uncertainties.

jmbeckers commented 2 years ago

The reason that sigma^2 (background variance) is difficult to calculate is actually the reason why it is very common to present relative errors in publications.

To make thinks even more complicated, the formulas provided by croachutas rely on a certain number of statistical hypothesis, most of which are not really satisfied in ocean observation system. Even the error field itself depends on quite some strong hypothesis (one if which is actually the correct specification of epsilon2).

So I see error maps basically as a rough indication of the robustness of the analysis rather than actual error estimates (personal view).

If you nevertheless want to provide absolute values, you can estimate sigma^2 using the data variance V and observation error variance (including representativity errors) e^2:

sigma^2 + e^2 = V

with epsilon2=e^2/sigma^2 the value provided to Diva (note that this is a relative error too)

with these two equations you can estimate sigma^2 from (1+epsilon2) sigma2=V

Since the error field is relying on many hypotheses we actually provide the approximations to it to reduce computing time. If you want to use the "exact" formula (see explanations by croachutas), you can use it too if you want, by specifying "exact" as method for DIVAnd_errormap. (might take some time though, depending on your grid size)

Use of DIVAnd_errormap:

fi,s = DIVAndrun(mask,(pm,pn),(X,Y),(x,y),f,len,epsilon2;alphabc=2); map,method=DIVAnd_errormap((mask,(pm,pn),(X,Y),(x,y),f,len,epsilon2, s; method = "auto", Bscale = false, alphabc=2 )

to let chose automatically. For force exact use method="exact".

Leon6j commented 2 years ago

Many thanks for the reply!

So the generated map value in the below syntax will be the grid with exact errors at my unit?

map,method=function DIVAnd_errormap((mask,(pm,pn),(X,Y),(x,y),f,len,epsilon2, s; method = "exact", Bscale = false, alphabc=2)

jmbeckers commented 2 years ago

no it will still be a relative error but based on the complete analysis covariance

Leon6j commented 2 years ago

Thanks! What will be its unit then?

Right now, I'm trying to calculate area-averaged mean of the variable, and add an error bar to it.

jmbeckers commented 2 years ago

no units, since it is relative

jmbeckers commented 2 years ago

Calculating the error on the average is still another problem (its not just the mean of the error).

Leon6j commented 2 years ago

Many thanks!

So we do not know how to calculate the error bar values yet?

jmbeckers commented 2 years ago

From memory you can calculate it (again the relative error variance) using the s structure coming out of the divandrun and calculate (h' s.Ph)/(h'*h) where h is an array containing the surface of each grid point created using a matrix "gridsurf" of grid surface (so the mask times grid surface if each grid point is the same)

h=statevector_pack(s.sv, (gridsurf,))

Without having tested ....

I think we could try to add an example somewhere or add a line in the notebooks on error calculations

Leon6j commented 2 years ago

Thanks! Having an example would be awesome...

Leon6j commented 1 year ago

Dear Jmbeckers,

Many thanks again for your equations to calculate the error bar values. Unfortunately, I still do not have a clue. That said, I do have an example now.

Below are the values of an example: Variable: 4.53 cpme: 0.00056.

In this case, how would I convert my cpme value to the true error value, so that I can plot the errors onto my errorbar plot?

jmbeckers commented 1 year ago

Dear Jmbeckers,

Many thanks again for your equations to calculate the error bar values. Unfortunately, I still do not have a clue. That said, I do have an example now.

Below are the values of an example: Variable: 4.53 cpme: 0.00056.

In this case, how would I convert my cpme value to the true error value, so that I can plot the errors onto my errorbar plot?

Can you please be more specific ? Is the example the example for a gridded analysis for which you show one grid point value and the corresponding cpme value ?

Or is it the spatial average of both your analysed field and cpme ?

BTW, in both cases your cpme looks very low, so you probably have a very low epsilon2 value. Make sure that makes senses for your application

Leon6j commented 1 year ago

Many thanks for the reply. It is the latter, i.e., both of them are the globally area-averaged values out of the gridded values at each of the grid point. Below are the settings for my DIVAnd.jl run:

LX = 146; LY = 76; len = (LX, LY); epsilon2 = 2;

I appreciate your help very much.

jmbeckers commented 1 year ago

Many thanks for the reply. It is the latter, i.e., both of them are the globally area-averaged values out of the gridded values at each of the grid point. Below are the settings for my DIVAnd.jl run:

LX = 14_6; LY = 7_6; len = (LX, LY); epsilon2 = 2;

I appreciate your help very much.

In that case, averaring the error field does NOT provide the error on the mean. For that you need to calculate, using the structure s returned from the analysis

uniform grid is assumed

gridsurf=ones(Float64,size(s.P)[1],1) erronmean=diagMtCM(s.P,gridsurf)[1]/(sum(gridsurf)^2)

That is the error variance of the mean, still scaled by the background error variance. (no units)

So if you need the standart deviation with the same units as your field:

epsilon2 is the noise/signal of the analysis and obsval are the observations you used

myabserror=sqrt.(1.0/(1.0+epsilon2)var(obsval).erronmean)[1]

Leon6j commented 1 year ago

Thank you!

So if my gridding function is as below, obsval would be fi, i.e., the gridded values on the 360x180 grid? Or should it be my 4.56 value? fi,s = DIVAndrun(mask,(pm,pn),(X,Y),(x,y),f,len,epsilon2;alphabc=2);

And all i need to do is to add the below lines to my code, and the generated myabserror will be the value I need, instead of the 0.00056 number I have currently?

    gridsurf      = ones(Float64, size(s.P)[1], 1);
    erronmean  = diagMtCM(s.P, gridsurf)[1]/(sum(gridsurf)^2);
    myabserror = sqrt.(1.0/(1.0+epsilon2)*var(obsval).*erronmean)[1];

I won't even need to run the below line anymore? cpme = DIVAnd_cpme(mask,(pm,pn),(X,Y),(x,y),f,len,epsilon2);

Sorry for my lack of knowledge in this topic and I really appreciate your big help.

jmbeckers commented 1 year ago

obsval is your vector f, the array containing all observations (they must fall on the domain to make sure the var(obsval) makes sense).

For the rest, yes, there is no need to calculate cpme.

But beware, as already stated, the estimate is only as good as the statistical model used (values of L and epsilon2). If they are not adapted, the error estimate itself makes no sense.

The last line can even be

myabserror = sqrt(1.0/(1.0+epsilon2)var(obsval)erronmean);

Leon6j commented 1 year ago

Thank you!

I got the below error: UndefVarError: var not defined.

Is the function var from a particular Julia package?

jmbeckers commented 1 year ago

var calculates the variance of your data. You can program it yourself or use the one from

using Statistics

Leon6j commented 1 year ago

Got it. Many thanks!

It is working now, although the derived error seems to be extremely small. I really hope such a real error function, instead of the current "cpme" will be available in the next version of the DIVAnd.jl.

Take care.

Leon6j commented 1 year ago

Right now, the error that is due to DIVAnd gridding is smaller than the measurement error. This is most likely impossible I think.

jmbeckers commented 1 year ago

a) an average of N values can have an error which is smaller then the individual errors. If you have a large number of values, then the error on the average will be indeed very low b) Nowhere in DIVA you specify absolute measurement errors. epsilon2 are relative errors again. c) The method to calcute the error on the mean does NOT use cpme or any other approximate calculations, it actualy uses the "exact" analysis error covariance.

That said, I already mentionned that I think you have a cpme is very small for the analysis itself. With the mentionned value of epsilon2=2 that would mean you have indeed a very high number of data in your domain.

Leon6j commented 1 year ago

Yes, I do have a very large number of measurements. I have 30 million data points, but these measurements are spread out in the global ocean. Some regions have terrible coverage, like the Arctic Ocean, and the Southern Ocean. I would assume the gridding errors in those regions will be huge. Some regions have dense measurements, like the North Atlantic and North Pacific. In those regions, the gridding errors will be relatively smaller. That's why I thought the CPME will give me an absolute error estimate at each of the grid point. But either way, it is not like that I have 30 million data points measuring one parameter at one location.

BTW, I put down an epsilon2 of 2 very subjectively. What is a more objective way of defining epsilon2?

Many thanks.

jmbeckers commented 1 year ago

well, the gridding errors in regions void of data depend on the correlation length. If there are data within reach of this length, the analysis can provide some information and reduce the errors. So I wonder what units are you working with for correlation length LX (and LY) and what are your coordinates ?

Do the actual analysis look natural ?

Leon6j commented 1 year ago

The correlation lengths are 84 x 42 (longitude x latitude). The units are in degrees.

The actual analysis look very natural.

jmbeckers commented 1 year ago

But then it is quite clear that you will get a very low gridding error because you basically tell DIVA that each "point" is representative of a region of 84x42 degrees !

Now imagine how many points you have on average on such a box: 8442/(360180)*30000000= 1.6E6 points. Then for sure the gridding error will be very low.

I think you should reconsider the analysis parameters. (There a plenty of examples to see the effect of these parameters in the workshop notebooks)

That said, with such a coverage, you might use other error methods (for the gridding; for the mean values, the method I provided earlier remains ok):

map,method=DIVAnd_errormap((mask,(pm,pn),(X,Y),(x,y),f,len,epsilon2, s; method = "auto", Bscale = false, alphabc=2)

jmbeckers commented 1 year ago

A notebook with explanations how to try to go from relative error to standart deviations and errors on averages

https://github.com/gher-uliege/Diva-Workshops/blob/master/notebooks/3-Analysis/14b-errormaps-demo.ipynb

I will try to update the notebook, but I think that closes the original issue as the problem now is rather related to the determination of analysis parameters (out of scope here)

Leon6j commented 1 year ago

function DIVAnd_errormap

Many thanks! Unfortunately, I got the error below: "LoadError: UndefVarError: DIVAnd_errormap not defined."

I'm unable to find many results by googling this function.

Do I really need the word "function" in the syntax?

jmbeckers commented 1 year ago

No, of course, you are right, it is without the "function" keyword (cut and paste error)

Leon6j commented 1 year ago

Thank you and sorry for the many questions. Unfortunately, I still get the error below: "LoadError: UndefVarError: DIVAnd_errormap not defined".

Here is my exact line of code: "map,method = DIVAnd_errormap(mask,(pm,pn),(X,Y),(x,y),f,len,epsilon2, s, method = "auto", Bscale = false, alphabc=2)"

jmbeckers commented 1 year ago

Do you have updated DIVAnd since some time ? You need one of the latest versions.

To see current version using Pkg Pkg.status("DIVAnd")

To update

using Pkg Pkg.update("DIVAnd")

jmbeckers commented 1 year ago

But I insist, before playing with error maps, you really should make sure to have a robust analysis setup (in particular, when working at global scale you should consider working with spherical distances instead of degrees and get reasonable L and epsilon2 values), otherwise the error map is just meaningless even if you are able to calculate one.

Leon6j commented 1 year ago

I'd love to try to do this properly. You mentioned a robust analysis setup based on spherical distances. How specifically could I do that? Below is my current setup. Many thanks!

All points are valid points

mask = trues(size(X)); pm = ones(size(X)) / (X[2,1]-X[1,1]); pn = ones(size(Y)) / (Y[1,2]-Y[1,1]);

Correlation length

LX = 84; LY = 42 len = (LX, LY); fi,s = DIVAndrun(mask,(pm,pn),(X,Y),(x,y),f,len,epsilon2;alphabc=2);

jmbeckers commented 1 year ago

Example here

Note how with these metrics you need correlation length in meters but positions are still in degrees

https://github.com/gher-uliege/Diva-Workshops/blob/master/notebooks/3-Analysis/Sphericalcoordinates.ipynb

If you use that version, the error calculation on the mean value needs an adaptation too (I will include it in the next release of

https://github.com/gher-uliege/Diva-Workshops/blob/master/notebooks/3-Analysis/14b-errormaps-demo.ipynb

But again, there are plenty of notebooks to look at to learn more about DIVAnd