openturns / openturns

Uncertainty treatment library
http://openturns.github.io/openturns/latest/index.html
Other
237 stars 93 forks source link

The TruncatedNormalFactory sometimes fail #1275

Open mbaudin47 opened 5 years ago

mbaudin47 commented 5 years ago

The following script creates a TruncatedNormalFactory from a sample:

import openturns as ot
ot.RandomGenerator.SetSeed(0)
a = -1
b=2.5
mu = 2.
sigma = 3.
distribution = ot.TruncatedNormal(mu,sigma,a,b)
distFactory = ot.TruncatedNormalFactory()    
sample = distribution.getSample(11)
newDistribution = distFactory.build(sample)
print("Built-in Fitted=",newDistribution)

This prints

Built-in Fitted= TruncatedNormal(mu = -16060.1, sigma = 561.146, a = -0.696535, b = 2.33702)

The bounds are correctly estimated, given that the sample min is -0.5701 and the sample max is 2.211. The mu and sigma parameters are completely wrong, with a very negative mu and a large sigma. In this particular case, the PDF cannot even be drawn:

>>> newDistribution.drawPDF()
Traceback (most recent call last):
  File "<ipython-input-67-bf36ab6df2ab>", line 1, in <module>
    newDistribution.drawPDF()
  File "/my/path/openturns/build/install/lib/python3.7/site-packages/openturns/model_copula.py", line 6265, in drawPDF
    return _model_copula.Distribution_drawPDF(self, *args)
TypeError: InvalidArgumentException : Error: cannot draw a PDF with xMax <= xMin, here xmin=-11500.3 and xmax=-11500.3

In this case the optimizer (TNC) uses a large number of function evaluations and only stops when the difference between two consecutive likelihood function values is close to zero.

The current implementation uses a clever scaling of the data and disables the estimation of the bounds of the TruncatedNormal. This is why I tested the basic MaximumLikelihoodFactory class:

MLfactory = ot.MaximumLikelihoodFactory(ot.TruncatedNormal())
newDistribution = MLfactory.build(sample)
print("ML Fitted      =",newDistribution)

this prints:

ML Fitted      = TruncatedNormal(mu = 0.00949447, sigma = 1.00589, a = -0.934979, b = 0.934979)

This cannot be good: the TruncatedNormal max is 0.9350 while the sample max is 2.211.

I assume that estimating the parameters of a TruncatedNormal is instrinsically difficult. Using the method of moments would be an option, but this is not so easy given that this requires to solve nonlinear equations (there are references on this topic).

This failure explains why the TruncatedNormal distribution performs so badly in #1061: each failed TruncatedNormal distribution fitting has a huge cost, which turns out to make the Kolmogorov-Smirnov test slow.

mbaudin47 commented 4 years ago

In order to improve the performances of the Kolmogorov test, I implemented several new build methods in:

https://github.com/mbaudin47/openturns/commits/FixChaos

with 3 build methods:

Below is a bench of the performance.

import openturns as ot
import time

# Generate a sample
ot.RandomGenerator.SetSeed(0)
N = ot.Normal()
n = 100
sample = N.getSample(n)

factory = ot.TruncatedNormalFactory()

benchSize = 1000

print("buildMethodOfLikelihoodMaximization")
start = time.time()
for i in range(benchSize):
    distribution = factory.buildMethodOfLikelihoodMaximization(sample)
end = time.time()
elapsed1 = end - start
print("Elapsed time = %s (s)" % (elapsed1))

print("buildMethodOfMoments")
start = time.time()
for i in range(benchSize):
    distribution = factory.buildMethodOfMoments(sample)
end = time.time()
elapsed2 = end - start
print("Elapsed time = %s (s)" % (elapsed2))

print("buildMethodOfScaledLikelihoodMaximization")
start = time.time()
for i in range(benchSize):
    distribution = factory.buildMethodOfScaledLikelihoodMaximization(sample)
end = time.time()
elapsed3 = end - start
print("Elapsed time = %s (s)" % (elapsed3))

def CreateBarPlot(origin, elapsed, label, color):
    data = [[1.0, elapsed]]
    myGraph = ot.Graph('Performance of TruncatedNormalFactory', 
                       'Method', 
                       'Time (s)', True, 'topright')
    myBarPlot = ot.BarPlot(data, origin, color, 
                           'solid', 'dashed', label)
    myGraph.add(myBarPlot)
    return myGraph

# Graph
import openturns.viewer as otv
graph = CreateBarPlot(1.0, elapsed1, 
                      "LikelihoodMax.", "red3")
graph2 = CreateBarPlot(2.0, elapsed2, 
                       "Moments", "cornflowerblue")
graph.add(graph2)
graph3 = CreateBarPlot(3.0, elapsed3, 
                       "ScaledLMax.", "orange")
graph.add(graph3)
otv.View(graph)

Screenshot from 2020-03-20 22-26-11

This shows that the 2 new build methods are not faster.