Percentile function - Githubissues

marcrocasalonso commented 6 years ago

Dr. Jonathf,

First of all, thank you for your time. Currently I'm trying to obtain same results as a example research paper about Uncertainty Quantification. With the main goal to compare results and be sure that the code are okey. The Paper is called "Efficient Sampling for Non-Intrusive Polynomial Chaos Applications with Multiple Uncertain Input Variables" and the example appears in the page 9.

My issue is that all the results (mean,variance,standard desv,) are equal between the Paper and my code and thats well. But when I want to campare the Confidence Interval(CI) of 95% with the function cp.Perc from chaospy the result does not match with the Paper. However, I have implemented the Bootstrap method to compare two different approaches to obtain CI and with this approach the CI is so closely.

Do you know how it works the cp.Perc function or what do you think about why I don't have the same results with Perc, bootstrap and the paper?

If you want I can send you the code.

Thank you very much for your time, cooperation in advance.

jonathf commented 6 years ago

Hey!

It is always a good idea to verify the results before trusting new tools.

The cp.Perc method calculates the percentile using Monte Carlo simulation. It shouldn't be to complicated, but it is a while since I looked at the functionality, so I will not exclude the possibility that you have found a bug.

Do you mind providing a minimal code that illustrates the behavior?

marcrocasalonso commented 6 years ago

Hey Jonathf! I attached you two documents. The "word" document contain information about the example that I'm implementing and the ".zip" document contains the python code.

In the example I use the seaborn and the bootstrap libraries, if you want you can install them to obtain the bootstrap confidence interval mit bootstrap library and the seaborn to see the histogram and density function.

Many thanks .

Marc

example_paper_results.docx

point_collocation_ejemplo_paper_exponential_send_jonathan_fei_2.zip

jonathf commented 6 years ago

Okay, thanks for the material.

When one talks about bootstrapping it is sometime a bit ambiguous for two reasons: 1) There are two version of boostrapping: parametric and non-parametric, and 2) what you are targeting for confidence interval might differ.

The Perc is a tool for creating model percentiles using parametric bootstrapping. The boostraping you are performing is non-parametric applied on the expected value estimator. In other words they measure different things.

To do non-parametric bootstrapping with polynomial chaos expansion you will not have any tools yourselves, but it isn't too complicated to implement. For example:

indices = np.random.choice(len(solves), (n, reps))
solves = np.array(solves)
deflexions = [cp.fit_regression(polynomials, samples[:, idx], solves[idx]) for idx in indices.T]
mb = np.array([cp.E(deflexion, distribution) for deflexion in deflexions])
p4 = np.percentile(mb, [2.5, 97.5])
print("Confidence interval 95% with Chaospy function=", p4)

Note that this is quite slow. Took about 24 minutes on my laptop now, but works as intended. For what it is worth, I got the output: ('Confidence interval 95% with Chaospy function=', array([12.35830706, 12.35845556]))

Hope this helps.

marcrocasalonso commented 6 years ago

Hey!:)

I was this week studying all that you have send me and I have a doubt. Which things I am measuring with this two approaches? The main objective of this two techniques is obtain the conficende interval right?

Thanks you. Hug

jonathf commented 6 years ago

When you talk about "confidence interval", it usually refers to parameter estimation, typically in the setting of frequentist approach to statistics. And that is exactly what you do when you do boostraping. The parameter is mu and its (unbiased) estimator is the average. It is sample based so bootstrapping is a hand in glove match.

On the other hand, polynomial chaos expansion is what is called a surrogate model. It doesn't give you parameter estimation, but instead a full model approximation. The whole point of this approach is that if the approximation is good, you can do analysis on the proxy is done much cheaper. The mean for example is not done by sampling, but analytically in PCE, making it a bad fit for placing a bootstrap method between the PCE and the parameter estimation, because the estimation error is here 0.

But what you were doing BTW in your first attempt was to create a CI not for the mean, but for the full model. Unless you make the argument that the PCE is itself an approximation of an estimator, you have now moved away from CI and over into prediction interval territory. Which is fine, but you are measuring something different.

My suggestion was somewhat of a hybrid approach. Instead of doing bootstrapping on the proxy model, instead you can do bootstrapping on the proxy model creation process all the way to the metric of interest. This makes sense from the point of view that A) the creation requires a sampling scheme making it compatible with bootstrapping, and B) in the end the point of using proxy is still to be able to retrieve estimates to parameters.

Hope this helps.

marcrocasalonso commented 6 years ago

But when I use Perc function with chaospy, the values interval that chaospy is calculing is out of the mean...

mean=12.35=Fourier_coefficient_0 interval 95%=10.75-11.44

How I interpret this result?

jonathf commented 6 years ago

Are you sure? Your code specifies the interval [47.5, 52.5] for Perc which is the 5% CI, not 95%.

This just implies that you estimated mean is offset a bit away from the median, which is what you converge to when when you get close to the 50 % percentile.

marcrocasalonso commented 6 years ago

Yes, yes you are right! But if you writes Per[2.5,97,5] the range interval is inside the probability densisty function and that's Okey. However I'm measuring the confidence interval from 95% with this Per function? And with bootstrap approach is not the confidence interval from 95% because the results are so far? So sorry with all of these questions... I think that are something that I don't understand well when I evaluate these two approach to CI...

jonathf commented 6 years ago

Let us say you a random variable X which is exponential distributed with lambda=1. Lets also say that you have {X_1, ..., X_100} i.i.d. samples drawn from X. Simple analysis gives you that `E(X)=1, but let us assume we do not have this available.

Let us ask the question: What is the best approximation of E(X) and what is the error on that approximation? A classical question from literature.

Starting simple, we know that the mean X_bar = 1/n sum(X_n) should work pretty well. How well? The error to this particular estimate is percentile([0.025, 0.975], X_bar). This quantity does exists, since each X_n is a random variable. Finding this CI is a good fit for boostrapping. For sake of argument, let us say that it gives you the confidence interval [0.96, 1.12]. It is quite narrow because there are enough samples to make X_bar quite more stable than X. This is what you have done, and should be familiar territory. If you take the prediction interval of X directly you get something like [0.03, 3.69].

Another approach is to apply surrogate modeling. Here instead of creating a predictor, we create a new variable X_hat. Note that X_hat is similar to X, not X_hat. Each of the three variables have (roughly) the same mean, but their variation is quite different. The reason for using a surugate variable is the X_hat is analyzable, while X isn't. So finding E(X_hat) can always be done analytically. Let us say it produces the value 0.98.

If you now ask the question: what is the confidence interval percentile([0.025, 0.975], X_hat), you hit a wall. The interval will always be [0.98, 0.98]. this is because as mentioned the last step, the E[X_hat] is done analytically, and therefore it has no error.

Now, doing percentile([0.025, 0.975], X_hat) is obviously possible. You used the Perc function. But that doesn't give you the confidence interval for the error of the estimate. Instead it gives you the prediction interval for X. The values should be close to [0.03, 3.69]. It is no longer a metric of how well you are doing in the estimation of the mean. Instead it gives you an estimation of how far the original exponential distribution is spread.

How do you get around this? Well, move the boostrapping process to include the creation of X_hat! That is the (very slow) method I propose for you. That works, but it does not involve the Perc function.

marcrocasalonso commented 6 years ago

Many many thanks for this explanation. I have understand you! My error was a bad interpretation between percentil and confidence interval with bootstrapping samples and surrogate polynomial, a concept error. Thanks you for all your time trying to explain to me with a easy way and your science contributions. I have valored it a lot. See you soon! Hug

jonathf commented 6 years ago

I am glad it helped you understand.

jonathf / chaospy

Percentile function #84

mean=12.35=Fourier_coefficient_0 interval 95%=10.75-11.44