ssmit1986 / BayesianInference

Wolfram Language application for Bayesian inference and Gaussian process regression
MIT License
35 stars 4 forks source link

Uncertainty quantification for a temporal auto correlation function #12

Open cmoecke opened 2 years ago

cmoecke commented 2 years ago

Dear Sjoerd,

as stated in the title, I want to perform uncertainty quantification for an ACF that I calculate dependent on some lag time dt. The ACF f(dt) is implicitly given by

d(dt) = A(1-f(dt))+B

where A and B are educated estimations of time independent parameters (i.e. noise and amplitude estimates of a continuous signal) and d(dt) is some quantity that is calculated for said discrete lag times dt; these are the input parameters. Let's say for now that these parameters are independent of each other and normally distributed. Further, I have some prior knowledge about f(dt), namely that it should be (normally) distributed within the closed interval [0,1] (same goes for A and B). As a first step, in order to estimate the associated uncertainty of the ACF, I did Gaussian propagation of uncertainty obtaining df(dt). Now it might happen that within the frequentist uncertainty margins, this prior knowledge is violated and I am having trouble reasoning putting a hard constraint on it. That's why I thought this would be a great use case of Bayesian statistics. However, I am not quite sure how to implement it properly within this excellent pacelet. My initial idea was to perform the nested sampling for each dt step as follows (please also see the notebook attached, where I provide example data and the full code.)

Table[nestedSampling[
  defineInferenceProblem["Data" -> {d[[i]]}, 
   "GeneratingDistribution" -> 
    NormalDistribution[ai*(1 - fi) + bi, dd[[i]]], 
   "Parameters" -> {{fi, 0, 1}, {ai, 0, 1}, {bi, 0, 1}}, 
   "PriorDistribution" -> {NormalDistribution[f[[i]], df[[i]]], 
     NormalDistribution[A, dA], NormalDistribution[B, dB]}], 
  "SamplePoolSize" -> 1000,
  "MonteCarloSteps" -> 1000,
  "TerminationFraction" -> 0.01,
  "MinIterations" -> 10], {i, Length@dt}];

where I input the values of f(dt) and df(dt) from previous calculations and the Gaussian uncertainty propagation, respectively. Here I am not sure, whether it would be possible to do the analysis with the full list of d, instead of employing Table and with that, the calculations could be sped up a bit, since for a full data analysis I would need to run this procedure ~50 times. I then get the new median and .68 quantiles by

Table[Around[#[[2]], {#[[2]] - #[[1]], #[[3]] - #[[2]]}] &@
  Quantile[EmpiricalDistribution[
    Values@fs[[i]]["Samples", All, "Point"][[All, 1]]], {1 - 0.68, 
    0.5, 0.68}], {i, Length@combdts}]

The values I get in the end seem plausible to me, though I am unsure whether this approach is actually suited for this problem. Intuitively it makes sense to me, but maybe I am missing something crucial.

Thanks in advance!

UCP_Bayesian.nb.zip

SjoerdSmitWolfram commented 2 years ago

Hello,

Thanks for trying my code! I'd be happy to try and see if I can help get this to work, but I'm currently fairly busy unfortunately. I hope I get around to it sometime next week, so I thought it would be best to drop you a quick message that I've seen your question :)

Best, Sjoerd

cmoecke commented 2 years ago

Hi,

Thank you very much! That’s really nice of you!! :) Looking forward to your answer, Conrad

SjoerdSmitWolfram commented 2 years ago

Hi,

I just had a quick preliminary look and some questions came to my mind:

  1. "it should be (normally) distributed within the closed interval [0,1]" What does this mean? A normal distribution is, by definition, unbounded. What kind of quantity does f represent? Is it always bounded between 0 and 1? Or between -1 and 1? In either case, it wouldn't make sense to me to model it with Gaussian noise.

  2. This is a regression-type problem, correct? Did you look at the logistic regression example in the notebook from the repo? I feel like the problem is fairly similar to that.

  3. What is the log likelihood function you want to use? If you can describe that clearly to me, it shouldn't be difficult to punch the problem into the code.

cmoecke commented 2 years ago

Hey, thanks for the reply!

1.) By that, I meant that the probability distribution of f for each lag time dt should be constrained within (0,1]. f represents an autocorrelation function, obtained from a light scattering experiment (e.g. capturing a bright-field video of beads moving in water). I employ it in the end to calculate a mean-squared-displacement which has the following proportionality:

MSD(dt) ~ -Log[f(dt)]

that's why I strictly require the probability distribution to be >0. The constraint of it being <=1 simply says that the MSD should not be negative, as the un-squared displacement is strictly real valued. This is always the case. For the other parameters, I only require them to be >0. So.. perhaps it would be better to assume that they are log-normally distributed?

2.) Unfortunately, I think it is not. It can be, though. If I know the underlaying mechanics of the measurement. That means, e.g. for the most simple case, which would be Brownian motion, the ACF decays like an exponential Exp[-D*dt], where D is the diffusion coefficient. However, for more involved measurements, where I am not completely sure, what's going on, I don't want to limit the evaluation at this stage by sticking to some (empirical) model (at least for now). In this case, it is a better choice for me to look the MSD, mentioned in 1.), and compare these for different measurements, model independently.

3.) There is, to my knowledge, no theoretical background on what the probability distributions of these parameters should be exactly. So I am not really sure.. maybe I'd stick to a similar log-likelihood function like stated in my initial post LogLikelihood[LogNormalDistribution[ai*(1-fi)+bi, dd[[i]]]] and get the posterior distribution of fi for each lag time step. I am sorry that this is not really precise.

Thanks a lot for looking into this!!

SjoerdSmitWolfram commented 2 years ago

If f is definitely constrained within [0, 1], I think you should look for a natural way to phrase a model for f that satisfies this constraint. A log-normal distribution (i.e., the exponential of a normal distribution) would make sense for a strictly positive quantity, but if it's constrained within [0, 1], it would make more sense to model f as, e.g., the LogisticSigmoid of some other quantity. So something along the lines of

TransformedDistribution[LogisticSigmoid[...],  ...]

Or perhaps as a BetaDistribution or something like that. In the end, though, I can't tell you how to model your data, of course. All I can say for sure it that the model should fit naturally with the natural constraints of the problem.

One thing you definitely do want to do, is to write a single loglikelihood function for the whole dataset instead of a table of 35 independent ones. If the observations are independent, you should be able to just sum the loglikelihoods together. The function defineInferenceProblem can take loglikelihood functions directly and in this situation I think that would be the most useful course of action.

cmoecke commented 2 years ago

Okay, so I will try that and close the issue once I succeed. That's helping me a lot, thank you :)