haskell / statistics

A fast, high quality library for computing with statistics in Haskell.
http://hackage.haskell.org/package/statistics
BSD 2-Clause "Simplified" License
300 stars 68 forks source link

Fix fromSample for exponential distribution, take 2. #192

Closed lorinder closed 1 year ago

lorinder commented 1 year ago

Hi,

Unfortunately, my previous fixes for exponential distribution of fromSample were not enough.

Summary of problems

For one thing, the formula is mean in that it used (ED (mean xs)), but ED takes the rate, not the mean, so it should be (ED (1/(mean xs))) instead.

Second, the requirement that all the data point must be non-negative to fit an exponential distribution is counter-productive.

Why allow negative values?

If the data points are a noisy, some of them can end up being negative, even if the underlying distribution is generally exponential. It would be better if fromSample still worked for such noisy samples, rather than gratuitously returning Nothing instead. Remember that in practice, pretty much all data is noisy to some extent.

I think that in general, fromSample should return a non-Nothing fit for any distribution if doing so is possible, i.e., will result in a reasonable fit with valid distribution parameters.

In the case of the exponential distribution, this means that the sample mean must exist (i.e., there must be at least data point, no NaN values), and it must be positive.

Example

As an example, below some sample code that generates exponentially distributed data (rate = 2), and adds some unbiased Gaussian (standard normal) noise on top of it. Our goal is to recover the underlying exponential distribution from the sample (i.e., to replicate fromSample's task). Three methods are considered:

  1. Simply compute 1/mean. This corresponds to fromSample if it allowed negative values.
  2. Discard negative values before doing 1/mean.
  3. Clamp negative values to 0 before doing 1/mean.

The discarding and clamping methods above correspond to workarounds if one were to use fromSample, but needs to first fix up the data so that fromSample does not reject it due to negative values.

For the first method I get the estimated rate 2.03, for second and third0.91 and 1.37 respectively. This illustrates not only that the workarounds are inferior (to the point of uselessness in this case), but also that the negative values contribute information on the true value of the rate.

-- Generate a noisy sample.

import Control.Monad
import qualified Data.Vector.Unboxed as V
import System.Random
import System.Random.Stateful

import Statistics.Distribution
import Statistics.Distribution.Exponential
import Statistics.Distribution.Normal
import Statistics.Sample as S

main = do
    let g = mkStdGen 10
    let l = 100000          -- number of samples

    -- Generate samples on base distribution
    let de = exponential 2
    let (se, g') = runStateGen g $ V.replicateM l . genContVar de

    -- Generate noise samples
    let dn = normalDistr 0 1
    let sn = if True then
                runStateGen_ g' $ V.replicateM l . genContVar dn
             else
                V.replicate l 0

    -- Add the two to get our noisy distribution
    let v = V.map (\ (x, y) -> x + y) $ V.zip se sn

    -- Estimate the exponential with different methods.
    putStrLn $ "The first few samples: " ++ show (V.take 10 v)
    putStrLn $ "Parameter fit: " ++ show (1/(S.mean v))
    putStrLn $ "fromSample: " ++ show ((fromSample v)::Maybe ExponentialDistribution)
    putStrLn $ "\nAlternatives:"
    let v' = V.filter (>0) v
    putStrLn $ "Parameter fit obtained by discarding negative values: " ++ show (1/(S.mean v'))
    let v'' = V.map (\ x -> if x > 0 then x else 0) v
    putStrLn $ "Parameter fit obtained by clamping negative values: " ++ show (1/(S.mean v''))
Shimuuar commented 1 year ago

For one thing, the formula is mean in that it used (ED (mean xs)), but ED takes the rate, not the mean, so it should be (ED (1/(mean xs))) instead.

This is bad. I'll fix it

Second, the requirement that all the data point must be non-negative to fit an exponential distribution is counter-productive.

Actually I think that fromSample is just bad function. I doesn't even attempt to quantify how likely that sample comes from distribution in question. But that will require some serious rework of API

Dealing with negative samples is somewhat delicate question. In that case we're dealing not with exponential convolved with some normal-like distribution presumably with unknown but small width. It's delicate stuff I need to think about it.

P.S. $\lambda = \bar x/ N$ is not ML estimator for exponential distribution. Simple ML estimator will not work for sample with negative values since $p(x<0) = 0$ for any λ so likelihood becomes zero too.

lorinder commented 1 year ago

Second, the requirement that all the data point must be non-negative to fit an exponential distribution is counter-productive.

Actually I think that fromSample is just bad function. I doesn't even attempt to quantify how likely that sample comes from distribution in question. But that will require some serious rework of API

Yes, agreed. Another problem with the API that I found is that sometimes, some parameters are known, and one would like to fix them, rather than have fromSample find them. For example, for the binomial distribution, in some applications the $n$ is known, and only $p$ needs to be found. fromSample does not support that kind of use case.

Dealing with negative samples is somewhat delicate question. In that case we're dealing not with exponential convolved with some normal-like distribution presumably with unknown but small width. It's delicate stuff I need to think about it.

Understood, and agreed.

As a note aside, the noise does not have to be normally distributed; the estimator $1/\bar{x}$ will converge to the correct value for any zero-mean iid noise, not just zero-mean normal noise.

P.S. λ=x¯/N is not ML estimator for exponential distribution. Simple ML estimator will not work for sample with negative values since p(x<0)=0 for any λ so likelihood becomes zero too.

Interesting, I hadn't noticed this! Pedantically, though, since every value of $\lambda$ has likelihood 0 in this case, the choice $\hat{\lambda} = 1 / \bar{x}$ certainly maximizes the likelihood function, and is thus ML. It's just not the unique choice that does that. (Every $\lambda$ does, as you correctly point out.)

The bigger point is that ML is no silver bullet.

lorinder commented 1 year ago

As a note aside, the noise does not have to be normally distributed; the estimator 1/x¯ will converge to the correct value for any zero-mean iid noise, not just zero-mean normal noise.

Actually, zero-mean and finite variance.

Shimuuar commented 1 year ago

I gave it some thought and you're right. This change will work for exponential data smeared with symmetric experimental errors. It will inflate error but we aren't computing confidence intervals anyway.

Another problem with the API that I found is that sometimes, some parameters are known, and one would like to fix them, rather than have fromSample find them.

I think this function was a mistake. Poor single function just can't do all sort of different tasks. But designing proper API (EDSL?) for fitting models is difficult.

Pedantically, though, since every value of λ has likelihood 0 in this case, the choice 1/x certainly maximizes the likelihood function

Estimator doesn't improve as more data is added. And working with something that happens with zero probability is really uncomfortable :)

The bigger point is that ML is no silver bullet. Well, nothing is