Discussion of Theory - Githubissues

mathematicalmichael commented 5 years ago

Find the numerical integration content to find the integral of the "evidence" term (denominator of posterior).

Think about how similar the likelihood seems to be to our solutions. Appears almost identical!

Plots:

[x] How do convergence results compare?
[x] How do stability results compare?

mathematicalmichael commented 5 years ago

I added the code to perform numerical integration into the notebook (can ignore the TODO now at the bottom), and once I did so, correctly normalizing led to me realize that the likelihoods were ever so slightly more confident. But not by a lot.

@yentyu any thoughts on this general observation? if not, will close this issue since the code has been added.

mathematicalmichael commented 5 years ago

only trick was getting the re-shaping to work out correctly as a lambda expression, which required overloading my exponential decay model to handle Floats in addition to numpy arrays.

D = distributions.parametric_dist(num_observations)
        for i in range(num_observations):
            D.assign_dist('norm', dim = i,
                          kwds={'loc': observed_data[i],
                                'scale': sigma})

        evidence = scipy.integrate.quad(lambda x: D.pdf(exponential_decay_model(x)[:num_observations].reshape(1,-1)),
                                        -1, 1)[0]
        outs = exponential_decay_model(lam_mesh)
        likelihood = D.pdf(outs)

yentyu commented 5 years ago

Here are some thoughts:

You could try and see if the "evidence" constant is being under-approximated by the quadrature. If it is being under-approximated, this would correspond to a slightly lower variance (due to numerical rounding)
You could compute the expected value of the variances (just compute variance for the 20 repeats and take average). Or a 95% prediction interval (since it's more or less normal, just take 2 std. from mode/mean). This would give you better numeric measure of "confidence" than an eyeball norm of the relative likelihood--though my eyeballs suggest something similar.
Research comparison: one thing you could look at which is less for the poster and more an interesting avenue of research is to compare the rate of convergence of the variances. You could do this numerically by computing the expected values of the variances as described above and then looking at that as sample size increases. It is possible that for this particular model map the bayesian posterior does have a narrower prediction interval, but doing this analysis would be a way to confirm your suspicions one way or the other.

mathematicalmichael commented 5 years ago

how would I be able to tell if it's under-approximated? could the domain over which I'm doing the integration change things?
I am getting the likelihood pdfs in closed form. to take the variances, what would you suggest doing? rejection sampling? How do I fit a density to pdf values? (ditto for third bullet point).
- Rejection sampling => sample mean + sample std of accepted => parametric fit for gaussian (or kde fit?)
- is there another way? how about scipy.distributions.fit?
- can use sample mean as a different estimator of the truth. the MUD is another (provided both are computed on the same mesh).
One thing I can do with the ability to evaluate the posterior pdf would be to evaluate the densities at some points of interest... 0.4, 0.45, 0.5, 0.55, 0.60. This would make for a "fair" comparison without any need for parametric fits.
What story are we trying to tell?

yentyu commented 5 years ago

Main Point: I am not sure any of this analysis regarding the slight differences in the likelihoods of the updated and posterior estimates of the true parameter is relevant to the story we are trying to tell for the poster. If we feel confident in the figures, I think that the key idea is that both methods result in similar / equivalent estimates. There may be a case where one method may be better than the other, but that is a topic for further research (at least w.r.t. the poster).

Responses to Other Questions: Overall, I was mostly giving thoughts on how to troubleshoot the differences you noticed, so if you are short on time and energy, I think we can ignore most of the extraneous details. For completeness:

Not sure w/ regard to under-approximation. The domain would change things, but since your prior is uniform [-1,1], integrating over this interval is enough since every other value of lambda other than this is 0.
For the variances, I would do sampling for simplicity. But you could also try integrating... I can't imagine that would be too bad since we're dealing with gaussians. To be clear:
- We want to compute $E(Var(\lambda|d))$. In other words, we want average a bunch of variances computed from different samples of data. So for each of the blue curves in the last figure, compute $\int_\Lambda \lambda^2\cdot pdf(\lambda|d)$ using quad. Then take the average of all the blue curve pdfs you computed.
- For each \lambda-pdf (blue curve) for a particular dataset, just do rejection sampling. Then compute the standard deviation of sample. Then compute the average of the standard deviations from different datasets (across blue curves). No need to do a parametric fit--you already have the pdf, so that's unnecessary.
- I didn't understand this comment about evaluating the densities at different points. This doesn't seem quite right to me at all.
Purpose of this analysis: you suggested that maybe a Bayesian inference is generally more confident in an estimate of the truth than a DCI inference. To check if this claim is true, we want to see if for different slices of data (i.e., different sets of 100 observations or different blue curves) if the standard deviation of the posterior is smaller (on average) than the updated distribution.

mathematicalmichael commented 5 years ago

Thank you, that was very helpful. So, to summarize,

use accept/reject to generate samples from the posterior vs updated density,
take variance of these accepted samples in order to compare confidence levels.
compute average over all trials.
the aforementioned addresses precision. let's address accuracy.
looking at MAP/MUD compared to truth is less interesting?
- is there any possibility we are more accurate? more biased?
  - my hypothesis is that with uniform priors as I have in the example, we do equally well
  - in other words, I think we will only "outperform" on this 1D example if we choose a prior/initial with mean far from truth, or at least something that is unlikely to sample in the regions of high probability in the updated.

That sounds like a fair comparison. Much better than what we have at this point in the notebook, which amounts basically to a qualitative comparison.

mathematicalmichael commented 5 years ago

This may or may not be relevant to the story now, but it is worth mentioning that some of the discussions I've had recently with experts in using Bayesian inference with large, complicated models is that they have trouble getting MCMC convergence, and widening priors actually leads to worse performance. They like the appeal of what Youssef/Ghattas do with Hessians but that requires a specification of a prior mean in the right "neighborhood" of truth, which may come from using a surrogate or another algorithm. So if that "confident answer" that initializes another algorithm is in some sense "far off" from truth, there's a chance that the solution will actually land somewhere between truth and the initial mean. Thus, in some sense it makes sense to look at examples where the prior is confident but wrong as a comparison against our method, which would "ignore" the prior more than the standard methods do, in theory. Like I said, my theory is that with wide priors, we do about as well. The advantages may come in "shifting" an incorrect guess, perhaps one that comes from existing bayesian methods.

yentyu commented 5 years ago

For a numeric estimate of the bias, just do the same thing you do for a numeric estimate of the variance: compute MUD/MAP over multiple sample sets (of same size) and take the average of these values computed from different slices of data. Obviously, it should be less biased as the sample size (number of observations in d) increases.

The estimator is unbiased if E(estimate)=true value, where the expected value is taken over all data slices.

The Bayesian MAP estimator is known to be biased for nonlinear maps with non-uniform priors. I am sure that our method is biased as well, though how much in theory is an open question. Isn't that what you and Troy are working on for linear maps by computing the closed forms for linear maps?

My feeling is that our methods will currently suffer just as much as Bayesian inference methods when a prior is confidently wrong, but for different reasons.

In the Bayesian context, being confidently wrong means that new data only moves the prior a little bit. Thus, even for large amounts of data, a Bayesian analyst will struggle with convergence.
In our case, while it is true that our method "ignores" the prior more than Bayesian methods, having a confidently wrong prior means that the data consistent values are essentially not in the support of the prior. Thus, unless our computational method does a certain amount of effective sampling of the region of the truth, the numeric convergence of our method is likely to suffer equally as much.

Again, this is just my gut feeling. But things to keep on the back burner.

mathematicalmichael commented 5 years ago

(could not get strikethrough as part of issue title). This was a fantastic discussion and I'll close this issue for now. I love that it's archived for future reference. very helpful.

mathematicalmichael / jsm19

Discussion of Theory #5