ssmit1986 / BayesianInference

Wolfram Language application for Bayesian inference and Gaussian process regression
MIT License
37 stars 4 forks source link

Fitting MixtureDistribution with large data #6

Closed cmock96 closed 3 years ago

cmock96 commented 3 years ago

Hey, I want to perform fitting of a MixtureDistribution with a dataset contarea that is a table of above 1000 entries. the code I am using is as follows

res["ContArea"] = nestedSampling[
  defineInferenceProblem[
   "Data" -> contarea,
   "GeneratingDistribution" -> 
    MixtureDistribution[{a, b}, {LogNormalDistribution[mu1, sig1], 
      GammaDistribution[mu2, sig2]}],
   "Parameters" -> {{a, 0.001, 1}, {b, 0.001, 1}, {mu1, 1, 10}, {sig1,
       0.01, 1}, {mu2, 1, 10}, {sig2, 10, 30}},
   "PriorDistribution" -> {"ScaleParameter", "ScaleParameter", 
     "LocationParameter", "ScaleParameter", "LocationParameter", 
     "ScaleParameter"}
   ],
  "SamplePoolSize" -> 100,
  "MonteCarloSteps" -> 1000,
  "TerminationFraction" -> 0.01,
  "MaxIterations" -> 3000
  ]

The inference process runs smooth, however, when I want to evaluate all viable fit parameters, via

result = takePosteriorFraction[res["ContArea"], 0.99]
points = Values@result["Samples", All, "Point"]

I will only get one set of fit parameters. When I chose to only take a low subsample of my data (e.g. 100 samples) it will give me several sets of fit parameters. Just fitting a NormalDistribution will result in the same problem. I also tried varying the parameters of

 "SamplePoolSize" -> 100,
  "MonteCarloSteps" -> 1000,
  "TerminationFraction" -> 0.01,
  "MaxIterations" -> 3000

to see whether it is connected to that, but that didn't help (at least for the possibilities I checked). Is this problem correlated with the large data size or am I overlooking something obvious here? Thanks for the help!

ssmit1986 commented 3 years ago

Hi,

First of all: glad to see someone giving my code a try :).

Without the data, I can only be of limited use because I'd need to try it out for myself to be really sure what's going on. It sounds like the algorithm trips up somehow and finishes before finding the bulk of the posterior distribution. Normally you'd fix that by increasing the "SamplePoolSize" and decreasing the "TerminationFraction", but in this case you could try increasing the "MinIterations" options to force the algorithm to explore the parameter space for longer. In fact, you could try a dry run with a low "SamplePoolSize" to find the rough location of the distribution and then narrow your prior accordingly. Not by too much, of course, but enough to get the sampler going in the right direction.

You could also try createMCMCChain to get a normal MCMC sample from the posterior with a long burn-in and see if that helps you locate the posterior mass. Or simply try EstimatedDistribution to at least find the maximum for a flat prior. The current version of the tool is not the best at narrowing down the bad part of the parameter space; I'd need to implement a method with a dynamically changing sample pool size to deal with that.

Finally, one thing you can try is to rephrase the problem as:

MixtureDistribution[{LogisticSigmoid[loga], 1 - LogisticSigmoid[loga]},
  {LogNormalDistribution[mu1, Exp[logSig1]],   GammaDistribution[Exp[logMu2], Exp[logSig2]]}
]

I noticed that you fit a and b independently, but this is problematic: the mixture distribution for a = b = 1 is exactly the same as for a = b = 2. Only the fraction a/b is actually important because MixtureDistribution normalizes the vector {a, b} internally. In addition, the positivity constraints on standard deviations etc. is usually best dealt with by using a positive mapping like Exp to make sure the parameters have the freedom to explore all positive numbers without having to impose an arbitrary lower bound.

Hope this helps :)

Edit

Fixed the proposed rephrasing of the problem.

cmock96 commented 3 years ago

Thank you so much for the immediate answer! Sadly, I cannot share the data, as it is yet unpublished. However, I tried replicating the problem by random sampling from a MixtureDistribution (which I obtained by FindDistribution[contarea]) as follows:

RandomVariate[MixtureDistribution[{0.7,0.3},{LogNormalDistribution[3, 0.2], 
 GammaDistribution[4.5, 20]}],1000]

and the same thing happens. I will try out the solutions you mentioned and report back to you. Thanks for the remark concerning the {a, b} vector - I just totally overlooked that.. .

As an off topic remark: I really enjoy the BayesianInference package and the amount of work you put into it. It really helps me to keep coding in Wolfram Language and not switch back and forth between other languages :)

ssmit1986 commented 3 years ago

Glad to hear it :). It's been a while since I worked on it, so it's good to hear that it's holding up.

If you can't get it to work, please post a minimal example using RandomVariate I can easily copy to try it out for myself so we won't be talking at cross odds about what's going on.

cmock96 commented 3 years ago

I added the suggested solutions, however there is something off with the fit now. Here is the code I used

contarea = 
  RandomVariate[
   MixtureDistribution[{0.7, 0.3}, {LogNormalDistribution[3, 0.2], 
     GammaDistribution[4.5, 20]}], 1000];
(*Find estimations for parameters*)
FindDistributionParameters[contarea, 
 MixtureDistribution[{LogisticSigmoid[loga], 
   1 - LogisticSigmoid[loga]}, {LogNormalDistribution[mu1, 
    Exp[logSig1]], GammaDistribution[Exp[logMu2], Exp[logSig2]]}]]
(*Run nested sampling algorithm (it does not take negative values in the "Parameters" option??  I added the - signs in the "GeneratingDistribution" option). I decreased the "TerminationFraction" drastically - it seems to fix the problem somewhat*)
res["ContArea"] = nestedSampling[
  defineInferenceProblem[
   "Data" -> contarea,
   "GeneratingDistribution" -> 
    MixtureDistribution[{LogisticSigmoid[loga], 
      1 - LogisticSigmoid[loga]}, {LogNormalDistribution[mu1, 
       Exp[-logSig1]], 
      GammaDistribution[Exp[logMu2], Exp[-logSig2]]}]
   ,
   "Parameters" -> {{loga, 20*0.8, 20*1.2}, {mu1, 3.4*0.8, 
      3.4*1.2}, {logSig1, 0.3*0.8, 0.3*1.2}, {logMu2, 3.07*0.8, 
      3.07*1.2}, {logSig2, 18.7*0.8, 18.7*1.2}},
   "PriorDistribution" -> {"ScaleParameter", "LocationParameter", 
     "ScaleParameter", "LocationParameter", "ScaleParameter"}
   ],
  "SamplePoolSize" -> 100,
  "MonteCarloSteps" -> 500,
  "TerminationFraction" -> 10^-6,
  "MaxIterations" -> 3000,
  "MinIterations" -> 1000
  ]
(*Plot data and inferences (GammaDistribution part does not seem to fit..)*)
With[{
  result = takePosteriorFraction[res["ContArea"], 0.99]
  },
 With[{
   points = Values@result["Samples", All, "Point"],
   weights = 
    Values@Quiet@
      Exp[# - Max[#] &@
        result["Samples", All, "LogPosteriorWeight", "Mean"]],
   model = 
    expressionToFunction[result["GeneratingDistribution", {1, 2}], 
     result["ParameterSymbols"]]
   },
  Show[
   Show @ MapThread[
     Plot[Evaluate@PDF[model[#1]][x], {x, 0, 300}, 
       PlotStyle -> Directive[Red, Opacity[#2/20]], Axes -> False, 
       GridLines -> Automatic, PlotRange -> {Automatic, All}] &,
     {points, weights}
     ],(*SmoothHistogram[{testdata["Normal"],testdata["Skew"]},
   PlotStyle\[Rule]{Red,Black}],*)
   SmoothHistogram[{contarea}, PlotStyle -> {{Dashed, Red}}, 
    PlotRange -> All],
   Plot[
    Evaluate[{
      PDF[predictiveDistribution[result], x]
      }],
    {x, 0, 300},
    PlotStyle -> {Black, Red},
    PlotRange -> All,
    Filling -> Axis, FillingStyle -> {Opacity[0.05]}
    ], FrameLabel -> {"x", "\[ScriptCapitalP](x)"}
   ]
  ]
 ]

Again - thank you for the help and patience :)

ssmit1986 commented 3 years ago

Hmmm, I haven't figured it out completely yet, but here are some things I noticed:

I'll let you know when I get more.

ssmit1986 commented 3 years ago

Ok, I think this should work reasonably well. I went back to your first try and changed it a bit:

contarea =  RandomVariate[  MixtureDistribution[{0.7, 0.3}, {LogNormalDistribution[3, 0.2],    GammaDistribution[4.5, 20]}], 1000];
res["ContArea"] = parallelNestedSampling[
  defineInferenceProblem["Data" -> contarea, 
   "GeneratingDistribution" -> 
    MixtureDistribution[{a, 1 - a}, {LogNormalDistribution[mu1, sig1],
       GammaDistribution[mu2, sig2]}], 
   "Parameters" -> {{a, 0, 1}, {mu1, 1, 10}, {sig1, 0.01, 1}, {mu2, 1,
       10}, {sig2, 10, 30}}, 
   "PriorDistribution" -> {"LocationParameter", "LocationParameter", 
     "ScaleParameter", "LocationParameter", "ScaleParameter"}], 
  "SamplePoolSize" -> 100, "MonteCarloSteps" -> 1000, 
  "TerminationFraction" -> 0.001, "MaxIterations" -> 3000, 
  "MinIterations" -> 1000
 ];

truncatedResult = takePosteriorFraction[res["ContArea"], 0.99]
Show[
 Quiet@Plot[
   Evaluate[{
     PDF[predictiveDistribution[truncatedResult], x],
     PDF[MixtureDistribution[{0.7, 
        0.3}, {LogNormalDistribution[3, 0.2], 
        GammaDistribution[4.5, 20]}], x]
     }],
   {x, 0, 200},
   PlotRange -> All
   ],
  SmoothHistogram[{contarea}, PlotStyle -> {{Dashed, Red}},  PlotRange -> All]
]

image

Note that you don't have to use the pre-canned "LocationParameter" and "ScaleParameter" priors. You can also use NormalDistribution[...] etc.

Don't forget about calculationReport to check the convergence!

cmock96 commented 3 years ago

Oh, great! Yes, works well on my end as well, thanks a lot!

Note that you don't have to use the pre-canned "LocationParameter" and "ScaleParameter" priors. You can also use NormalDistribution[...] etc.

Don't forget about calculationReport to check the convergence!

I did't know about that, very good to know. I tried to find the definitions of "LocationParameter" and "ScaleParameter" in the documentation, but couldn't find them and kinda forgot about it again. I have one question left: would it be possible to infer the effect size, i.e. Abs[mu1-mu2]/Sqrt[(sig1^2+sig2^2)/2] directly from the data and also obtain a standard deviation of it? I don't know how I should go about it. The thing I do right now is to calculate the effect size and then calculate the according uncertainty via Gaussian propagation of uncertainty, using the standard deviations of the Bayesian estimation of respective parameters. Although I think this is a reasonable approach, I don't think it's really elegant.

ssmit1986 commented 3 years ago

Yeah, the documentation could still do with some updates. There's a lot of ground to cover, so I try to update bits and pieces every time I come back to it. The Basic code example section shows what I mean. I recently updated is so you don't always have to specify the limits. If you specify distributions as priors, then this should also work:

defineInferenceProblem[
 "Data" -> data,
 "GeneratingDistribution" -> NormalDistribution[\[Mu], \[Sigma]],
 "Parameters" -> {\[Mu], \[Sigma]},
 "PriorDistribution" -> {NormalDistribution[0, 100],  ExponentialDistribution[1/100]}
 ]

As for the question about effect size: I think you should be able to just calculate Abs[mu1-mu2]/Sqrt[(sig1^2+sig2^2)/2] for each sample point and then use the MC estimate of that. So you just take the "EmpiricalPosteriorDistribution" property (which is an empirical distribution generated by the samples) and then transform that into the effect size distribution:

effectSizeDist = TransformedDistribution[
 Abs[mu1 - mu2]/ Sqrt[(sig1^2 + sig2^2)/2], 
 {a, mu1, sig1, mu2, sig2} \[Distributed]  truncatedResult["EmpiricalPosteriorDistribution"]
];
Mean[effectSizeDist]
StandardDeviation[effectSizeDist]
InverseCDF[effectSizeDist, {0.01, 0.99}]
cmock96 commented 3 years ago

Cool, that looks very good. I appreciate the help very much.