prappleizer / prappleizer.github.io

This repo contains my introductory python textbook for astronomy students, which covers the basics of learning the language with an emphasis on astronomical applications.
http://prappleizer.github.io/textbook.pdf
55 stars 22 forks source link

German tank problem, discrete uniform prior in emcee #3

Open toastmaker opened 3 years ago

toastmaker commented 3 years ago

Hello,

I read your blog and you may help me. I am learning Bayesian stats, discovered emcee and I want to simulate the famous German tank problem. You capture some tanks with serial numbers captured = [42, 63, 101] and you want to estimate the total size of the army if you assume that the serial numbers go from 1 to Big_Number.

The prior on N, P(N) would be discrete uniform between max(captured) and Big_Number (upper bound), 0 otherwise.

The likelihood sampler is also simple, P(x_i | N) = 1/N for 1 < x_i <= N, 0 otherwise.

I managed to write down ln_prior and ln_likelihood in case of observing 1 tank and made it run in emcee if I hardcoded a single x_1 observed value in the likelihood, but I am clueless how to extend it to a vector of captured tanks. I thought that it would be simple, just to calculate the likelihood as the product of P(x_i | N) over the data, but I don't know how to pass the data.

I always end with

_ValueError: If you start sampling with a given logprob, you also need to provide the current list of blobs at that position.

Would you show me how to define the parameters and discrete uniform distributions in emcee when input x is a vector and likelihood depends on parameters, please?

big_number = 2000
x_obs = [42]
m = max(x_obs)

def log_prior(n):
    if (m <= n <= big_number):
        return 0.  # flat prior
    else:
        return  -np.inf

def log_sampling(x, n):
    if 0< x <= n:
        return -np.log(n)
    else:
        return  -np.inf

def log_posterior(n,D):
    lp = log_prior(n)
    s = 0.0
    for i in D:
        s += log_sampling(i,n)
    return lp + s

nwalkers, ndim = 5,1
pos =  m +  (2000-m) * np.random.randn(nwalkers, ndim)

sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=[x_obs])
sampler.run_mcmc(pos, 20000, progress=True);
prappleizer commented 3 years ago

Hi Martin,

In general, it is trivial to extend 1 dimensional problems to vectors associated with each sample. However, I don't think that's actually needed in your case. The bayesian likelihood one determines for the german tank problem (I hadn't heard of it, just looked it up) depends only on m and k, the highest serial number in the set, and the number of tanks captured. It doesn't depend on the values of serial numbers of any of the other tanks.

Of course, the "bayesian" case is just a simple maximum likelihood estimation that can be derived analytically. So there's not really any reason to estimate it using emcee (i.e., we turn to numerical integrals when the analytic integral in question is intractable or, more often, high-dimensional).

My thought is that if you are trying to "simulate" the problem with emcee, you still want to have the same bounds as the analytic case. The two numbers we have (N tanks captured, and highest serial captured) don't change from model evaluation to model evaluation. The only thing that changes from model to model is "N", the total number of tanks, and the thing we're trying to solve for.

So I might be misunderstanding the question here, but I'm pretty sure this is a one-parameter fit anyway. The issue here is your Likelihood function seems to only depend on N, whereas it should depend (internally) on an input N as well as m and k. Then you don't need D to be multidimensional, nor iterate over it to get a joint likelihood (which, side note, you can write your functions such that D just gets passed around as a vector array, rather than loop over it).

I think the likelihood you want is is the thing given as the probability mass function in the wiki article on the german tank problem, for which the likelihood L(n=N | m,k) <-- is the likelihood the nmax = N given some m, k. This likelihood is 0 if n<m, and is some function of n, m, and k otherwise.

On Wed, Apr 28, 2021 at 5:24 PM Martin Topinka @.***> wrote:

Hello,

I read your blog and you may help me. I am learning Bayesian stats, discovered emcee and I want to simulate the famous German tank problem. You capture some tanks with serial numbers captured = [42, 63, 101] and you want to estimate the total size of the army if you assume that the serial numbers go from 1 to Big_Number.

The prior on N, P(N) would be discrete uniform between max(captured) and Big_Number (upper bound), 0 otherwise.

The likelihood sampler is also simple, P(x_i | N) = 1/N for 1 < x_i <= N, 0 otherwise.

I managed to write down ln_prior and ln_likelihood in case of observing 1 tank and made it run in emcee if I hardcoded a single x_1 observed value in the likelihood, but I am clueless how to extend it to a vector of captured tanks. I thought that it would be simple, just to calculate the likelihood as the product of P(x_i | N) over the data, but I don't know how to pass the data.

I always end with

ValueError: If you start sampling with a given log_prob, you also need to provide the current list of blobs at that position.

Would you show me how to define the parameters and discrete uniform distributions in emcee when input x is a vector and likelihood depends on parameters, please?

big_number = 2000 x_obs = [42] m = max(x_obs)

def log_prior(n): if (m <= n <= big_number): return 0. # flat prior else: return -np.inf

def log_sampling(x, n): if 0< x <= n: return -np.log(n) else: return -np.inf

def log_posterior(n,D): lp = log_prior(n) s = 0.0 for i in D: s += log_sampling(i,n) return lp + s

nwalkers, ndim = 5,1 pos = m + (2000-m) * np.random.randn(nwalkers, ndim)

sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=[x_obs]) sampler.run_mcmc(pos, 20000, progress=True);

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/prappleizer/prappleizer.github.io/issues/3, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACD6XXHJYHDNUMSDUIJU6TDTLB4J7ANCNFSM43X3SURQ .

toastmaker commented 3 years ago

Thank you!

You are right that it's 1D problem of the unknown n. It transforms to be the function of the parameters m and k, but let's imagine I don't know it in advance and I can ask what is the likelihood P(x_i | N) for each x_i that I observed. I can turn it into 1d problem of variable n:

def log_prior(n):
    m = max(D)
    if (m <= n <= big_number):
        return 0.  # flat prior
    else:
        return  -np.inf

def log_sampling(x, n):
    if 0< x <= n:
        return -np.log(n)
    else:
        return  -np.inf

def log_posterior(n):
    lp = log_prior(n)
    s = 0.0
    for i in D:
        if n < i:
            s += -np.inf
        else:
            s += log_sampling(i,n)
    return lp + s 

pos =  m +  (2000-m) * np.random.randn(nwalkers, ndim)
nwalkers, ndim = pos.shape

sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior)
sampler.run_mcmc(pos, nsteps, progress=True);

but I get the same error even if I use no arguments of posterior

_ValueError: If you start sampling with a given logprob, you also need to provide the current list of blobs at that position.

toastmaker commented 3 years ago

And what is not clear to me with emcee and didn't get from the docs, in the ln_posterior(theta, x), shall I consider theta,x to be numpy arrays and therefore be careful with conditions such as x < 3? Do I need to use np.any, np.all, np.where etc? Or is it consider as a posterior for a single element? Would you know, please?

toastmaker commented 3 years ago

This is the code that does what I want in pymc3, it's very intuitive, but I am unable to replicate it in emcee. Would you help me to rewrite it in emcee, please? The whole point is to interlink "num_tanks" together in the prior and in the likelihood at the same time, which is something I fail to do correctly in emcee.

import pymc3 as pm

with pm.Model():
    num_tanks = pm.DiscreteUniform(
        "num_tanks", 
    lower=max(captured), 
    upper=2000
    )
    likelihood = pm.DiscreteUniform(
        "observed", 
    lower=1, 
    upper=num_tanks, 
    observed=captured
    )
    posterior = pm.sample(20000, tune=1000)
toastmaker commented 3 years ago

I can hard code the likelihood , to return comb(n,k) * 1/n**k which is very explicit and I was wondering if emcee can do this for me if I define only N discrete uniform and Like discrete uniform