sgbaird commented 3 years ago

From Dr. Johnson:

You could discuss “the Brandon criterion” in the context of the size of the FZ and GB correlation lengths. For example the 15 degree threshold for low angle grain boundaries represents approximately 2 correlation lengths, and roughly 25% of the diameter of the FZ. I wonder if the correlations over a distance of 15 degrees in any direction are the same as those in misorientation angle (I.e what is special about misorientation angle?)

From Dr. Homer:

In Hunter’s project I was hoping to disprove the low-angle hypothesis, but ended up supporting it. Not necessarily the 15 degrees but the overall concept doi:10.1016/j.scriptamat.2020.03.062

Sterling:

I think the linked dataset would be pretty easy to use in order to get estimates of correlation lengths within/across those subspaces and w.r.t. to correlation length. If someone splits the data into the subspaces of interest and converts it to a format I can work with (misorientation/BP normal pairs or rotation matrices or orientation of each grain + BP normal), I could run the data through and get correlation lengths for each of the dataset splits. Anything past that, I'd probably need to defer to someone else.

The data is hosted on Mendeley

The grain boundaries would need to be converted to octonions. i.e. I'd need someone to write/adapt a conversion function from this format to octonions (even if through one of the intermediates, such as five2oct.m). Perhaps it's not that hard though, since the UVW and tilt angle can be converted using ax2qu.m, and the other grain is just the identity quaternion [1 0 0 0]. Correct? Does the BP normal determine whether it's a tilt or twist? If so, what is it? Really, I just need the orientation of the two grains and a BP normal.

Once someone has decided on the appropriate splits, and the conversion is verified, then I can run it through (but probably just once).

sgbaird commented 3 years ago

Examples of correlation length plots from Dr. Johnson:

sgbaird commented 3 years ago

@oliverkjohnson

Notes from Earlier

covariance matrix with a specified "Xnew" that contains the boundaries we want to look at all of the input data and the special points to look at: list of low-Sigma boundaries Sigma 1 - Sigma 11 corresponding pairwise distance matrix

1 Olmsted model (7.4 degree), 1 Fe model

Coordinates for all the points - 7D & 5D versions Olmsted - Sigma 1 is row # ..., Sigma 3 is row # ... Kim Fe - do an ismembertol()

"grid" - 20000 points

GB Key

1-388: Olmsted GBs (same order as original, 388 Olmsted GBs)
389-2888: VFZ-NBOs (i.e. perfect crystals, 2500 NBOs)
2889-22888: Random VFZ-GBOs (20000 VFZ-GBOs)

Data

gpr388-olmsted-cov.mat
gpr46883-kim-cov.mat (EDIT 2021-05-10 - Ignore the Fe data for now)
eucl22888.mat (pairwise-distance matrix and 8D VFZ coordinates)

:warning: the distance matrix is given in "half-the-Toby-metric" format in radians (i.e. the non-scaled Euclidean distance). This is because the code is set up to work with normalized coordinates. That is, when you probe the GPR model at new coordinates, you do so with unit normals on the hypersphere (as opposed to norm==sqrt(2)). This shouldn't change anything for the covariance matrix because:

(i.e. when you scale the Euclidean distance, you just scale the correlation length with it)

I left it in 8D coordinates because the "new" GBs are the same for the Ni and Fe datasets, but the coordinates will differ in their 7D (and lower) dimensions (the calculated distance-preserving SVD transformation is different for the Ni and Fe data). Also, the degenerate dimension shouldn't cause trouble for any run-of-the-mill dimensionality reduction techniques (PCA, MDS, T-SNE, etc.). If there's some other reason you need the dimension-reduced/rotated coordinates, let me know. Alternatively, a simple way to do it yourself is:

load('eucl22888.mat', 'pts', 'd')
dim = 5;
[coeff, score] = pca(pts, 'NumComponents', dim);
X = score*coeff';

See pca(). In the VFZ repo, I generally use proj_down.m to accomplish this, but realized recently that pca() does most (if not all) of what's in proj_down.m already. Granted I'm not sure what the equivalent of proj_up.m is with pca().

When I output to .csv files, the covariance matrices take up ~10 GB, so I'm leaving it in .mat format for now.

Low-GBE/Low-Sigma Key

Sigma	ID
3	3
5	169
7	32
9	21
11	33

Plots

I'd like it if you would use built-in binscatter(x, y) or equivalent when you make the plots like the ones above, esp. since we're dealing with so many points (~20k), and so many more pairwise elements (~500 million). How does that sound?

Code

If you're ok with it, I'd also like to add the code that will reproduce these plots and analysis to the repo.

sgbaird commented 3 years ago

I'm thinking:

max 1-2 additional figures
1-2 sentences to the abstract
2-4 sentences to the conclusion
A methods subsection with 1-2 paragraphs
1-2 paragraphs added to "Correlation Lengths" Results subsection
1-2 paragraphs in a "Correlation Lengths" subsubsection

I've added some placeholders throughout the paper for these.

The main body is already at ~10 pages (excluding references) and 4 figures, and I'd like to keep it to no more than 12 pages and 6 figures. Anything extra can go into supplementary info instead.

oliverkjohnson commented 3 years ago

Two requests for the Data:

Currently (5/10/2021) eucl22888.mat only contains the 8D coordinates (i.e. there is no pairwise distance matrix like the description says). Did you intend to include the pairwise distance matrix as well?
Can you also post the input GB energies for each of the points (i.e. the Olmsted and Kim energies, not the predicted GPR energies)?

oliverkjohnson commented 3 years ago

Also, the covariance matrix in "gpr388-olmsted-cov.mat" is formatted as a vector that is 1-by-261918828. Is this just the upper triangular portion of the covariance matrix (it has the right number of elements, but I just wanted to double check)?

sgbaird commented 3 years ago

Currently (5/10/2021) eucl22888.mat only contains the 8D coordinates (i.e. there is no pairwise distance matrix like the description says). Did you intend to include the pairwise distance matrix as well?

Yes, I intended it to have the pairwise distance matrix. My bad, I'll try to fix it, but in the meantime you can do pd=pdist(X); where X are the coordinates which should compute pretty quickly. If X has norm sqrt(2), then it should already be the octonion distance (radians). If X has unit norm then multiply by 2 to get octonion distance.

sgbaird commented 3 years ago

Can you also post the input GB energies for each of the points (i.e. the Olmsted and Kim energies, not the predicted GPR energies)?

Olmsted and Kim subset, but I still need to update the Kim data (covariance matrix, etc.)

sgbaird commented 3 years ago

Also, the covariance matrix in "gpr388-olmsted-cov.mat" is formatted as a vector that is 1-by-261918828. Is this just the upper triangular portion of the covariance matrix (it has the right number of elements, but I just wanted to double check)?

See squareform() to convert back to the full matrix. I think it's upper triangular.

oliverkjohnson commented 3 years ago

Can you also post the gpr predicted energies for all 22888 points (for the olmsted data)?

oliverkjohnson commented 3 years ago

There are 3 techniques I'm planning to use to estimate the correlation length:

The semivariogram method from geostatistics (this is a standard technique to estimate spatial correlation lengths for kriging models in geostatistics).
The correlation function $\color{cornflowerblue}\rho(x,x')$
The smoothness length parameter estimated by the gpr fitting algorithm

Here is an initial result for (1) showing the semivariogram with a gaussian fit (it is notable that the Gaussian model fits so well, and other models like exponential or spherical would not fit well---the inflection point is characteristic of the Gaussian model):

We get an estimate for the correlation length from the fit (red line), but to have meaning I need to understand what units that distance is in, and whether it is a toby distance (2 times actual degrees) or actual degrees. Your definition above says "the distance matrix is given in "half-the-Toby-metric" format in radians (i.e. the non-scaled Euclidean distance)." Does that mean that to get meaningful distances in degrees I just need to convert the values on the x-axis above from radians to degrees (since it is already half of the toby distance which is twice the crystallographic angular distance)? @sgbaird Sterling can you comment on this?

We also need to decide how to parameterize the semivariogram model in a way that is consistent with our definitions (i.e. there's lots of different parameterizations of a "Gaussian"). My initial preference would be to use $\color{cornflowerblue}\kappa(h) = c_s \left(1-\exp{\left(\frac{-\omega^2}{2\ell^2}\right)}\right)$

Assuming that my interpretation of the units is correct, this gives a Gaussian correlation length (smoothness length) of 3.77 degrees for the Olmsted data using the semivariogram technique.

oliverkjohnson commented 3 years ago

I am finding there is an issue with the covariance matrix in the format you have posted them in. Since it is a linear representation of the lower-triangular (below the main diagonal) portion of the actual covariance matrix, it is missing the main diagonal. The main diagonal of the covariance matrix contains the marginal variances, and is required to extract the correlation matrix from the covariance matrix. So I need that main diagonal, either separately as a vector, or I need the full covariance matrix.

sgbaird commented 3 years ago

We get an estimate for the correlation length from the fit (red line), but to have meaning I need to understand what units that distance is in, and whether it is a toby distance (2 times actual degrees) or actual degrees. Your definition above says "the distance matrix is given in "half-the-Toby-metric" format in radians (i.e. the non-scaled Euclidean distance)." Does that mean that to get meaningful distances in degrees I just need to convert the values on the x-axis above from radians to degrees (since it is already half of the toby distance which is twice the crystallographic angular distance)? @sgbaird Sterling can you comment on this?

It's in radians, and needs to be multiplied by 2 and converted to degrees to get it in the traditional octonion sense. All the numbers in the paper so far have been in the octonion sense.

Assuming that my interpretation of the units is correct, this gives a Gaussian correlation length (smoothness length) of 3.77 degrees for the Olmsted data using the semivariogram technique.

Or 7.54 degrees in the traditional octonion sense, which is really close to the global correlation length in the paper (7.4 degrees). It may be interesting to try using a distance matrix that doesn't have distance overestimation. I've wondered if the correlation length will change significantly.

I am finding there is an issue with the covariance matrix in the format you have posted them in. Since it is a linear representation of the lower-triangular (below the main diagonal) portion of the actual covariance matrix, it is missing the main diagonal. The main diagonal of the covariance matrix contains the marginal variances, and is required to extract the correlation matrix from the covariance matrix. So I need that main diagonal, either separately as a vector, or I need the full covariance matrix.

Right, that was my bad. This is fixed in the next version. The two files are each ~7 GB. I'm also going to be trying for the one day per week schedule, so I'll try to get these to you by next Monday.

oliverkjohnson commented 3 years ago

Sounds good, enjoy the time with the family. Remind my your new baby's name?

One additional issue we need to resolve is how to define the "correlation length" or what criterion we use to say that GBs are correlated over a distance of X. The working definition you and I have both been using is the "smoothness length" that is in the denominator of the argument to the squared-exponential. This is a fine definition, but may at least need some explanation. What the smoothness length really represents is the distance over which the correlation drops from 1 to about 0.61 (i.e. a distance of a single standard deviation), and a correlation of 0.61 is still a decent correlation, if we consider instead a correlation length equal to 2*smoothness_length then we're saying that's the distance over which the correlation drops from 1 to about 0.13, and if we consider a correlation length equal to 3*smoothness_length we are looking at the distance over which the correlation drops from 1 to about 0.01. I think from a practical perspective GBs become effectively uncorrelated beyond a distance of about 2 or 3 smoothness lengths, so I think this nuance is something we need to address in the paper if we're going to assert what GB correlation lengths are (even if we're just saying what we find them to be for the datasets we investigated).

oliverkjohnson commented 3 years ago

Also, just to clarify what items I still need:

The diagonal elements of the covariance matrix
The GPR predicted energy for every point

sgbaird commented 3 years ago

Sounds good, enjoy the time with the family. Remind my your new baby's name?

\<removed>

One additional issue we need to resolve is how to define the "correlation length" or what criterion we use to say that GBs are correlated over a distance of X. The working definition you and I have both been using is the "smoothness length" that is in the denominator of the argument to the squared-exponential. This is a fine definition, but may at least need some explanation. What the smoothness length really represents is the distance over which the correlation drops from 1 to about 0.61 (i.e. a distance of a single standard deviation), and a correlation of 0.61 is still a decent correlation, if we consider instead a correlation length equal to 2smoothness_length then we're saying that's the distance over which the correlation drops from 1 to about 0.13, and if we consider a correlation length equal to 3smoothness_length we are looking at the distance over which the correlation drops from 1 to about 0.01. I think from a practical perspective GBs become effectively uncorrelated beyond a distance of about 2 or 3 smoothness lengths, so I think this nuance is something we need to address in the paper if we're going to assert what GB correlation lengths are (even if we're just saying what we find them to be for the datasets we investigated).

You bring up a really good point. I agree, it will be good to be more precise in the definition and explanation. Glad you have familiarity with these definitions.

oliverkjohnson commented 3 years ago

... it would also be really helpful to have

the misorientation and GB plane (i.e. the standard 5 parameters) for each point (this will help with the Brandon Criterion discussion)

sgbaird commented 3 years ago

I think I have everything put together. First, pull the most recent changes then open and run posterior_data.m. This should download the GPR models from figshare and save the necessary data. I reduced the number of random VFZ-GBOs (npts) from 20000 to 15000 so the matrices are a bit more manageable in size. If you run out of memory, you can comment out the following lines:

    fivelist{i} = five;
    covmats{i} = covmat;
    ds{i} = d;
    ypredlist{i} = ypred;
    ysdlist{i} = ysd;
    clear five covmat d ypred ysd

or lower npts further.

For the Olmsted GPR model, the data is a stack of Olmsted GBs, no-boundary octonions (NBOs), and random points. For the Kim GPR model, the data is a stack of high-symmetry Kim GBs, then the same NBOs and random points as above.

Keys

Olmsted

1-388: Olmsted GBs (same order as original, 388 Olmsted GBs) 389-2888: VFZ-NBOs (i.e. perfect crystals, 2500 NBOs) 2889-end: Random VFZ-GBOs (15000 VFZ-GBOs)

Kim

1-2254: Kim GBs (same order as original, 2254 high-symmetry Kim GBs) 2255-4754: VFZ-NBOs (i.e. perfect crystals, 2500 NBOs) 4755-end: Random VFZ-GBOs (15000 VFZ-GBOs)

Low-GBE/Low-Sigma

Sigma	Olmsted ID	Kim ID
3	3	7
5	169	162
7	32	259
9	21	315
11	33	406

Files

A = gpr388-olmsted B = gpr46883-kim

1 = -cov.mat == covariance matrix 2 = -dist.mat == pairwise distance matrix 3 = -other.mat == 5DOF coordinates (five), unit-norm octonions (pts), high-symmetry input GBEs (y), predicted values (ypred, ysd), low-GBE/low-Sigma IDs (ids), and the corresponding Sigma values (Sigma)

A1, A2, A3 B1, B2, B3

e.g. A1 == gpr388-olmsted-cov.mat

Distances are half of the "Toby" distance. 5DOF coordinates and octonions are given with active convention. Let me know if you have questions.

sgbaird commented 3 years ago

Now that the data is a more manageable size, I'm adding the 6 output files on figshare.

oliverkjohnson commented 3 years ago

Ok, one last issue is the scaling of the octonion ("Toby") distance. When we last talked about it we decided to use half of the "Toby" distance, but that was based on my memory of a derivation I had done to show the relationship between the toby-distance and the misorientation for symmetric tilt GBs. The problem is, I was working from memory and got things mixed up. The actual relationship is that the original octonion distance metric (the "toby distance") is equal to 1/2 of the difference in disorientation angles between two symmetric tilt GBs: $\color{cornflowerblue}\Omega = \frac{1}{2}\Delta\omega = \frac{1}{2}\left|\omega_1-\omega_2\right|$

Which means that I think the relevant distance we should be using is actually $\color{cornflowerblue}\Delta\omega = 2\Omega$ , or "twice-the-toby-distance". This has implications for a bunch of things in this paper (and perhaps in the Computational Materials Science paper as well, though I'm not as sure about that), in particular, it governs what we report as "correlation lengths." So I think we should have a conversation about this to make sure we're both on the same page and we are using a consistent distance scaling. Send me an email or text me to let me know when you want to chat about this and we can finalize our approach so I can make the figures and tabulate my results.

oliverkjohnson commented 3 years ago

Also, the variable five has only 4 columns, so I'm not sure how to extract the 5DOF parameters from it. Can you explain? Is it perhaps just the disorientation quaternion (i.e. it is missing the GB plane)?

oliverkjohnson commented 3 years ago

Ok, the code is not in a state that I can share it yet, but I think I can now successfully calculate and make the plots we want for the paper. Here are the major talking points:

Table showing the global "smoothness-length" parameter* for both datasets (olmsted and kim) from 3 different methods (variogram, posterior correlation matrix, prior estimate from fitrgp**---though I need to talk to you about this, see the note below).
Plot showing the local "smoothness-length" parameter for each CSL GB type***, for both datasets (olmsted and kim), compared to the threshold defined in the Brandon criterion [1]. The discussion here includes the following:
- The fact that the data shows different "smoothness-length" values (both different from the global and different from each other) suggests that it would be better to use a model based on a non-stationary covariance function. This would also have the benefit of possibly resolving cusps more accurately.
- The "smoothness-length" parameters generally increase with Sigma, in contrast to the threshold defined by the Brandon criterion. There is a paper [2] that suggests that the LAGB threshold should be selected based on the material property of interest rather than a fixed value based on dislocation models for the structure of GBs (like the Brandon criterion), and this result supports that suggestion, but further indicates that the tolerance/threshold should also be selected based on the property and that the tolerance for energy is qualitatively different than what the Brandon criterion would suggest.
Plot of $\color{cornflowerblue}l' = l\sqrt{-2\ln{\rho}}$ (see below for an example with artificial values of l). The definition of a "correlation-length" (l') depends on your threshold for defining correlations. We have an equation that tells you what the "correlation-length" will be for any value of rho that you want to use, given the fixed value of l (the "smoothness-length" parameter)--either a local l or a global l. We can suggest a value of rho that we think is reasonable and summarize what that means for the resulting "correlation-length".

footnotes

*We need to make sure we're both using the same distance scaling so that all of the reported parameters (as well as other results from this and the CMS paper) are consistent. For all of my calculations I am using 4EuclideanDistance. We probably need to chat to make sure everything follows this convention (or a different convention that we agree on). It appears that the default settings for fitrgp set the value of sigma_l using the heuristic mean(std(X)), which seems rather irrelevant to me since it doesn't depend on the properties at all (it only depends on the spread of the predictors over their domain), but it gives a value of about 17 degrees, which is only a slight overestimate compared to what I get using the other methods. Did you use this default setting to estimate the kernel parameters, or did you perform hyperparameter optimization for the models you built? I have code to look at all of the representatives of each Sigma, but I need to know the 5 parameters for each GB, and as I mentioned in a previous comment, right now the variable five is nGB-by-4 so I'm not sure exactly what this variable contains (is it just missing a 5th column?, or is it the misorientation quaternion with the boundary plane missing?, etc.) As soon as I get an answer from you I can run this for all of the representatives.

references

[1] D. Brandon, The structure of high-angle grain boundaries, Acta Metall. 14 (1966) 1479–1484. doi:10.1016/0001-6160(66)90168-4. [2] A.H. King, S. Shekhar, What does it mean to be special? The significance and application of the Brandon criterion, J. Mater. Sci. 41 (2006) 7675–7682. doi:10.1007/s10853-006-0665-8.

sgbaird commented 3 years ago

Hi Sterling,

I have been finalizing my calculations for the Acta paper, and there is one thing that I need your help to sort out. I am trying to verify that the covariance matrix you have me is the posterior covariance and NOT the prior covariance. When I plot the correlations (extracted from the covariance matrix grp-olmsted-cov) as a function of distance, I get a curve that follows perfectly a squared exponential distribution, which is suspicious to me in that it has no noise at all. This made me suspect that the covariance matrix you gave me may in fact be the prior covariance rather than the posterior covariance. I searched the internet to see if I could find any information to determine one way or another and the only things I could find were from MATLAB Answers (and you had either asked the questions or commented on them), and the one alternative I found was by using the undocumented function predictExactWithCov. When I did this (using model.mdl.data.ppts as the “new data”) I got a covariance matrix that was very different from the one you had given me, and which has more noise (as I would expect). The smoothness length parameters estimated from the two different covariance matrices are quite different (2.8 degrees for the predictExactWithCov method, and 7.4 degrees for the gpr-olmsted-cov matrix you gave me). I include a plot of the correlation vs. distance for both covariance matrices below so you can see the difference.

I need your help to sort out 2 things:

How do I get the posterior covariance matrix (is it gpr-olmsted-cov, or what I get from predictExactWithCov, or something else?)?

How were the kernel parameters chosen that are stored in the model you fit? In the matlab help it gives the formula that is used to determine the default values from the data (without any hyperparameter optimization), and while the resulting values are close they do not match the values stored in the model, so I’m not sure where those values are coming from.

One note is that I couldn’t find where the input data for your model fit was stored for the Olmsted data (since it is a compact gpr object), so I used model.mdl.data.ppts (which is 7D as the function requires), but I’m not sure that is the correct input data to use, so that may be one source of the discrepancy if that isn’t the right data to use.

Can you help me answer (1) and (2)? Perhaps it would be good to chat over ZOOM to make sure we both understand one another?

Thanks!

Oliver

sgbaird commented 3 years ago

Hi Dr. Johnson,

Thanks for the patience.

I was using:

kfcn = gprMdl.Impl.Kernel.makeKernelAsFunctionOfXNXM(gprMdl.Impl.ThetaHat);
covmat = kfcn(ppts,ppts);

as produced in posterior_data.m, but I'm realizing that I probably should have been using predictExactWithCov. I think I mistook the "Exact" to mean without input noise, but running some tests suggests that's not the case. For example, I can increase the Sigma parameter, and still use the predictExactWithCov function, with the expected result that sampled values are noisier and the uncertainty at the input data is larger:

Compare with the plots from https://www.mathworks.com/matlabcentral/answers/709828

I think this suggests that anywhere I used makeKernelAsFunctionOfXNXM, I need to replace it with predictExactWithCov. I'm sure it will also affect some comments in the text as well.

As for the kernel parameters, there are 3 types of values: initial, fitted (via e.g. gradient descent, LBFGS, quasi-Newton), and Bayesian hyperparameter optimization (or grid search, etc.). The initial values are the starting values, which then get fitted/optimized by maximizing the likelihood function. By default, Bayesian hyperparameter optimization isn't performed (likely because it can be costly). The relevant equations are here which link to other methods such as FIC approximation, which we use for the larger datasets. I played around with the Bayesian hyperparameter optimization a lot, but it takes a while for the large datasets and doesn't appear in any of the paper results. That is, I only used the fitting/optimization, which is a quasi-Newton optimization by default.

The input data is stored in mdl.mesh.ppts (deprecated naming scheme from using input and mesh interchangeably early on). The output (validation) data is stored in mdl.data.ppts. For the models on figshare though, what was stored in mdl.data.ppts was dummy data (just a single octonion) since I used all the data for training. For example, __gpr58604_gitID-3b7183c_puuID-ce759533_kim-Fe-train-all-data-fic.mat__. So you may be using older versions of the files, though I don't think it will affect the results much. You can either download the updated figshare models or rerun posterior_data.m (make sure to pull the latest GitHub changes first).

Sterling

sgbaird commented 3 years ago

Also, the variable five has only 4 columns, so I'm not sure how to extract the 5DOF parameters from it. Can you explain? Is it perhaps just the disorientation quaternion (i.e. it is missing the GB plane)?

I think I answered this separately (phone call, text?) but this should be resolved now. For completeness, I'm adding a comment here. There was a bug where I was only saving the misorientations and not the BP normals as you suggested.

sgbaird commented 3 years ago

I reran posterior_data.m and am in the process of uploading these onto figshare.

sgbaird commented 3 years ago

When do you think you can rerun the data by? (Obviously after your vacation)

oliverkjohnson commented 3 years ago

When do you think you can rerun the data by? (Obviously after your vacation)

I plan to do this by the end of the week. Will you have any time this week in case I have questions?

sgbaird commented 3 years ago

Ok, sounds great! Yes

sgbaird commented 3 years ago

Do you think the large change for the Olmsted dataset and the smaller change for the Kim dataset from prior to posterior might have to do with the amount of input data available? https://www.researchgate.net/post/effect_of_Prior_on_Posterior_on_Bayesian_Analysis

oliverkjohnson commented 3 years ago

Do you think the large change for the Olmsted dataset and the smaller change for the Kim dataset from prior to posterior might have to do with the amount of input data available? https://www.researchgate.net/post/effect_of_Prior_on_Posterior_on_Bayesian_Analysis

Maybe. Gaussian process regression essentially just solves a generalized non-linear least-squares problem, in which the objective function is the misfit, which is itself a weighted sum of two types of "error": (1) the error between the posterior model and the prior model, and (2) the error between posterior model predictions and the data. In other words it is trying to simultaneously make the posterior model as close as possible to both the prior model and the observed data, where the respective weights are the inverse of the prior model and data covariance matrices. If you have large uncertainty in your data then it will weight that less and the posterior will look more like your prior. On the other hand, if you have large uncertainty in your prior, then the posterior will be closer to the data. So it's not just how much data you have, but also how much you trust that data.

That being said, the amount of data does have an influence. Remember in the 1D case how when the observations were sparse, there were oscillations in between? It could be that because the Olmsted data is more sparse it has a lot more of those oscillations, and perhaps that is what leads to a shorter length-scale of correlation? I may try to do some simple tests with a 1D inference to see if I see similar results.

oliverkjohnson commented 3 years ago

Do you think the large change for the Olmsted dataset and the smaller change for the Kim dataset from prior to posterior might have to do with the amount of input data available? https://www.researchgate.net/post/effect_of_Prior_on_Posterior_on_Bayesian_Analysis

Maybe. Gaussian process regression essentially just solves a generalized non-linear least-squares problem, in which the objective function is the misfit, which is itself a weighted sum of two types of "error": (1) the error between the posterior model and the prior model, and (2) the error between posterior model predictions and the data. In other words it is trying to simultaneously make the posterior model as close as possible to both the prior model and the observed data, where the respective weights are the inverse of the prior model and data covariance matrices. If you have large uncertainty in your data then it will weight that less and the posterior will look more like your prior. On the other hand, if you have large uncertainty in your prior, then the posterior will be closer to the data. So it's not just how much data you have, but also how much you trust that data.

That being said, the amount of data does have an influence. Remember in the 1D case how when the observations were sparse, there were oscillations in between? It could be that because the Olmsted data is more sparse it has a lot more of those oscillations, and perhaps that is what leads to a shorter length-scale of correlation? I may try to do some simple tests with a 1D inference to see if I see similar results.

Ok, I did tests with some 1D model functions and I observed the same phenomenon: the length-scale obtained from the posterior correlation matrix is much smaller than what the data seems to suggest based on the variogram, and the posterior correlations show the same oscillatory behavior. These results appear to be independent of the amount of noise or the sparsity of the data (I considered 1% noise vs. 20% noise, and npts = [20, 100, 1000] in a 1D space). I also found that if I apply a variogram to estimate the length-scale from the posterior model itself (as opposed to the posterior correlations) the length-scale is very close to what the variogram and the fitrgp found.

So, I don't understand why the posterior correlation approach yields the results that it does, but I am inclined to assume that it's telling us a different kind of information than what we are looking for, so I think I am fine just using the other two methods (variogram and the fitrgp estimate of the prior kernel parameters). What do you think?

oliverkjohnson commented 3 years ago

Ok, I think I figured it out. Too long to explain right now, but the result is that I should be getting the posterior smoothness length from the posterior mean model, NOT from the posterior covariance matrix. I'll fix that and then I should be able to get the final calculations done and figures sent to you tomorrow or Saturday.

oliverkjohnson commented 3 years ago

Ok, I have the "final" figures made and I uploaded them to the interp5DOF Overleaf document in the "figures" directory. I don't have time yet to write about them, but I will try to get that done next week. Here's the results in tabular form:

<\tr> <\tr>

	$\color{cornflowerblue}l\ \ [\mathrm{toby\ deg.}]$
Dataset	Variogram (Input Data)	Variogram (Posterior Mean at Input Points)	Variogram (Posterior Mean at Random Points)	fitRGP Prior
Olmsted	7.5491	7.7316	8.8551 +/- 0.1970	7.3995
Kim	6.2665	7.5179	7.7082 +/- 0.1746	8.3379

And here they are in case you're interested: GlobalCorrelationLengthVariograms Fig. 1: Global variogram fits (a) Olmsted dataset, (b) Kim dataset.

LocalCorrelationLengthVariogramsOlmsted Fig. 2: Local variogram fits for Olmsted dataset.

LocalCorrelationLengthVariogramsKim Fig. 3: Local variogram fits for Kim dataset.

BrandonCriterionComparison Fig. 4: Brandon criterion comparison to local smoothness lengths (l).

CorrelationLengthVsRho Fig. 5: Correlation length for Olmsted & Kim datasets as a function of correlation strength (rho).

sgbaird commented 3 years ago

WOW! This is great. Thank you for taking the time to put all this together. I think this is really informative and can give us some really interesting things to discuss in the paper. I'm noticing that the limits of d_Omega are generally about 30-35 degrees. Is this because you set a threshold on the data or is there some other reason I'm missing? Normally, I thought the "Toby" distance in our VFZs had a max principal component of ~60-70 degrees, so just wanted to make sure it was actually "Toby" distance (and if so, I'm definitely OK with it).

oliverkjohnson commented 3 years ago

This is Toby distance, a distance cutoff was used. In variogram methods it is conventional to only consider distances up to about half of the maximum distance (hence about 30 deg instead of the full range of ~60 deg).

oliverkjohnson commented 2 years ago

Ok, I have included the figures in the Overleaf manuscript and finished writing my part in both the methods and results sections. With these additions I think you'll want to review the content of the manuscript as a whole and then write yourself an outline to figure out how best to organize the paper so that it flows and is cohesive (right now it is a bit like a patchwork quilt for obvious reasons--I think the filename "main-frankenstein-2.tex" says it all). Once you've got the outline figured out, I think putting the finishing touches on it should be much easier. I'll wait to hear from you once you've got a "final draft" ready for review. Let me know if you need my input before then. I think it's really coming together to be a nice paper.

sgbaird commented 2 years ago

See my email with subject title: Re: Request for Feedback on interp5DOF / Acta Mat Manuscript (May 10)

sgbaird / interp5DOF-paper

Correlation lengths of GB subspaces #9

Notes from Earlier

GB Key

Data

Low-GBE/Low-Sigma Key

Plots

Code

Keys

Olmsted

Kim

Low-GBE/Low-Sigma

Files

footnotes

references