stan-dev / loo

loo R package for approximate leave-one-out cross-validation (LOO-CV) and Pareto smoothed importance sampling (PSIS)
https://mc-stan.org/loo
Other
149 stars 34 forks source link

Extremely large k? #58

Closed yao-yl closed 6 years ago

yao-yl commented 6 years ago

Currently I am using PSIS diagnostics to evaluate ADVI performance in all stan examples. Here is one example in which the log density ration between the target and proposal is extremely heavy-tailed (if you are curious, it is from /ARM/Ch.18/weight_censored.stan, thought it is unrelated). The input log_ratio is vector of length 3000 without any NA, but I will get an error message when I use function psis() to estimate k:

>library(loo)
>load("log_ratio.RData") ## Attached files.
>psis(log_ratios= log_ratio)  ## log_ratio is a vector of log ratios.
Error in if (any(k > too_high)) { : missing value where TRUE/FALSE needed

The log_ratio Rdata is attached at https://drive.google.com/open?id=1SdZB36hqkG8ftfh5L_efyehiVAGDnLWF It seems the k-hat estimation is so large that it becomes NA. This error appears in both the released version (2.0) and the develop version.

If we check the density of this log ratio, it is indeed heavy-right tailed. I tried using a subset of log_ratio, say log_ratio[1:300], then I could get a finite k-hat=60. If I increase that sample size, the estimated k-hat will also increase and eventually blow up, which I believe is the reason why such error occurs.

I can also make everything run by manually deleting the single largest log ratio.

From a practical perspective, I guess it is enough to simply add something (e.g., a pre-truncation of log ratios, or an if statement to exclude k-hat estimation=NA) that prevents k-hat blowing up, since a k>60 is already large enough for users to stop immediately. Furthermore, I am curious for how we can solve it theoretically-- We claim the generalized-Pareto distribution can well approximate any one-dimensional tail distribution, but then why do we have this "non-convergence" (hat k \to infinity)?

jgabry commented 6 years ago

Maybe we set k-hat=Inf? And could this just be that we need many more tail samples for the theory to hold up in this case?

On Sun, Feb 25, 2018 at 12:05 AM Yuling Yao notifications@github.com wrote:

Currently I am using PSIS diagnostics to evaluate ADVI performance in all stan examples. Here is one example in which the log density ration between the target and proposal is extremely heavy-tailed. The input log_ratio is vector of length 3000 without any NA, but I will get an error message when I use function psis() to estimate k:

library(loo) load("log_ratio.RData") ## Attached files. psis(log_ratios= log_ratio) ## log_ratio is a vector of log ratios. Error in if (any(k > too_high)) { : missing value where TRUE/FALSE needed

The log_ratio Rdata is attached at https://drive.google.com/open?id=1SdZB36hqkG8ftfh5L_efyehiVAGDnLWF It seems the k-hat estimation is so large that it becomes NA. This error appears in both the released version (2.0) and the develop version.

If we check the density of this log ratio, it is indeed heavy-right tailed. I tried using a subset of log_ratio, say log_ratio[1:300], then I could get a finite k-hat=60. If I increase that sample size, the estimated k-hat will also increase and eventually blow up, which I believe is the reason why such error occurs.

From a practical perspective, I guess it is enough to simply adding something (e.g., an if statement to exclude k-hat estimation=NA) to prevent k-hat blowing up, since a k>60 is already large enough for users to stop immediately. Furthermore, I am curious for how we can solve it theoretically-- We claim the generalized-Pareto distribution can well approximate any one-dimensional tail distributions, but then why do we have this "non-convergence"?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/loo/issues/58, or mute the thread https://github.com/notifications/unsubscribe-auth/AHb4Q07C_rgtQq2fwU1aNFPY59ZuVlUDks5tYOodgaJpZM4SSK3R .

avehtari commented 6 years ago

Furthermore, I am curious for how we can solve it theoretically-- We claim the generalized-Pareto distribution can well approximate any one-dimensional tail distribution

We don't claim for any. We state that asymptotically if we go for enough in the tail, then for many. I think these very large khats are due to not yet being far enough in the tail.

I tried using a subset of log_ratio, say log_ratio[1:300], then I could get a finite k-hat=60

Are log_ratio's in random order? With small sample size, the prior will have some influence, too.

If I increase that sample size, the estimated k-hat will also increase and eventually blow up

I assume that this sample size is all log_ratio's and khat is estimated using only the certain portion of the largest log_ratios?

From a practical perspective, I guess it is enough to simply add something (e.g., a pre-truncation of log ratios,

pre-truncation of log-ratios is difficult as we don't know without a model what is too large weight.

since a k>60 is already large enough for users to stop immediately

I think any k>2 is so large that we don't learn much more information knowing the value more accurately,

Maybe we set k-hat=Inf?

I think that would be a good choice as it would indicate that khat estimation failed.

Although it doesn't solve the problem, it would be interesting to know how Stan Generalized Pareto Distribution fit http://mc-stan.org/users/documentation/case-studies/gpareto_functions.html, would work in this extreme case.

yao-yl commented 6 years ago

I tried stan fit of Generalized Pareto Distribution:

> S <- round(sqrt(length(log_ratio))*3)   ### using the same number of tail, S=149 
> y <- sort(log_ratio,decreasing=T ) [1:S]  ### pick the largest S samples.
> ymin <- min(y)
> ds <- list(ymin=ymin, N=length(y), y=y, Nt=2, yt=c(14000,15000))
> fit_gpd <- stan(file='gpareto.stan', data=ds, refresh=0,
                chains=4, seed=100)  ### gpareto.stan from stan case study.

It runs smoothly . But to my surprise, I get a k estimate=0.22, with a misleadingly small posterior variance.

> print(fit_gpd)
Inference for Stan model: gpareto.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

               mean se_mean      sd     2.5%      25%      50%      75%    97.5% n_eff Rhat
sigma        75.12    0.37    9.89    59.90    69.82    75.12    80.69    93.08   722 1.00
k             0.22    0.00    0.09     0.07     0.16     0.21     0.27     0.42  1545 1.00

The result must be wrong because the distribution is obviously right-heavy tailed. rplot01

So I check the divergence:

> check_treedepth(fit_gpd)
0 of 4000 iterations saturated the maximum tree depth of 10.
> check_energy(fit_gpd)
E-BFMI indicated possible pathological behavior:
  Chain 1: E-BFMI = 0.167
  Chain 2: E-BFMI = 0.063

MLE gives similar answer using stan optimization.

avehtari commented 6 years ago

y <- sort(log_ratio,decreasing=T ) [1:S]

y is now on log scale, but gpareto.stan assumes them to be on linear scale. Change to

y <- exp(sort(log_ratio,decreasing=T ) [1:S])
yao-yl commented 6 years ago

Sorry for that confusion. Now I change log ratio to the exp scale and it blows up. Since max(log_ratio)-min(log_ratio)=1400, it leads to a large number after exponentiated, no matter what additive normalization conducted. This will make the log density negative infinity. The stan sample fails too.

I can delete the largest log ratio, then I get a estimate k=87 in stan fit.

I think the problem is clear now. An extremely large log ratio makes the log density of pareto distribution become negative infinity. For the purpose of diagnostics, we may return k=+inf in this situation.

jgabry commented 6 years ago

I think we need to change some code to return Inf. The current behavior is an error, right?

On Mon, Feb 26, 2018 at 1:24 PM Yuling Yao notifications@github.com wrote:

Closed #58 https://github.com/stan-dev/loo/issues/58.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/stan-dev/loo/issues/58#event-1492670223, or mute the thread https://github.com/notifications/unsubscribe-auth/AHb4Q_Q1oE0Uj5l2N6dDJ5WWKdHDO3Etks5tYvbfgaJpZM4SSK3R .

yao-yl commented 6 years ago

I am not sure whether it should return an Inf or a NA with a warning. The true shape parameter might exist, but is just too large for current parametrization to handle.

For example, I can generate sample from a g-Pareto distribution with k=100 using R function rgp(n=1000,shape=100,scale=80). The true k is 100, but all estimation methods will fail (Stan Generalized Pareto Distribution fit; MLE; and psis estimation on log scale).

Sure, k=100 is almost Inf for the purpose of PSIS diagnostic. Nevertheless I am wondering if it is possible to find a better parameterization for a large k estimation.

jgabry commented 6 years ago

Yeah, I’m not sure. What prior are you using when estimating with Stan? And do you think we could use a ton of tail samples? Impractical sometimes but would be good to know.

On Tue, Feb 27, 2018 at 2:38 PM Yuling Yao notifications@github.com wrote:

I am not sure whether it should return an Inf or a NA with a warning. The true shape parameter might exist, but is just too large for current parametrization to handle.

For example, I can generate sample from a g-Pareto distribution with k=100 using R function rgp(n=1000,shape=100,scale=80). The true k is 100, but all estimation methods will fail (Stan Generalized Pareto Distribution fit; MLE; and psis estimation on log scale).

Sure, k=100 is almost Inf for the purpose of PSIS diagnostic. Nevertheless I am wondering if it is possible to find a better parameterization for a large k estimation.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/stan-dev/loo/issues/58#issuecomment-369000088, or mute the thread https://github.com/notifications/unsubscribe-auth/AHb4Qxu4w3U7TgbjOtCHP8PqhY1lo5nUks5tZFmrgaJpZM4SSK3R .