superbobry / pareto

GSL powered OCaml statistics library
http://superbobry.github.io/pareto/0.2
MIT License
40 stars 5 forks source link

Multivariate Distributions #26

Open nrlucaroni opened 11 years ago

nrlucaroni commented 11 years ago

Is there a way to abstract the 'float' from the distribution modules to also include 'float array' (or other data-types) to fully extend the distribution modules? and is that sufficient to extend distributions to multivariate ones? Something like...

module type Mean = sig
  type t
  type elt
  val mean : t -> elt option
end
nrlucaroni commented 11 years ago

I'm misunderstanding why adding the following will not work...

module type MultivariateDistribution = sig
  include BaseDistribution with type elt := float array
  val dimension : t -> int
end

with error message,

Error: Only type constructors with identical parameters can be substituted.

but the following does,

module type MultivariateDistribution = sig
  type vector = float array
  include BaseDistribution with type elt := vector
  val dimension : t -> int
end
superbobry commented 11 years ago

Abstracting elt type in Mean and similar signatures sounds good. However, this won't be enough to support multivariate distributions.

I'm unsure on what's the best way to approach this, but the first thing that comes to mind isn't very elegant:

module type UnivariateDistribution = sig
    type t
    type elt = float

    include BaseDistribution with type t := t and type elt := elt
end

module type MultivariateDistribution = sig
    type t
    type elt

    include BaseDistribution with type t := t and type elt := elt 
end

(* And, the boilerplate for discrete-continuous cases. *)

The reasons we currently have discrete continuous cases separated are:

  1. It's handy to indicate which type of distribution your function operates on;
  2. GSL doesn't provide quantile functions for discrete distributions;
  3. We use labels to indicate the type of the argument for probability and cumulative_probability, so simply abstracting the type of the random variable won't work. Example:

    Normal.(cumulative_probability ~x:0.42 standard)
    Poisson.(cumulative_probability ~n:10 (create ~rate:.42))

As for the compiler error, I've never seen this one before, I think we should ask for clarifications in the mailing list.

superbobry commented 11 years ago

Update: compiler error is documented here:

There are a number of restrictions: [...] the definition must be either another type constructor (with identical type parameters).

superbobry commented 11 years ago

I've tried to generalize distribution signatures, so now each distribution also has an elt type. However, I'm unsure what to do with remaining signatures. For instance, Mean:

module type Mean = sig
  type elt
  type t

  val mean : t -> elt
end

Most discrete distributions have real means, so we can't just include Mean with type elt := elt and including Mean with different types seems hackish to me. What do you think?

nrlucaroni commented 11 years ago

Yeah that's a tough one.

superbobry commented 11 years ago

Actually, what do you think about switching to objects for distributions? that way can get rid of all of the micro-signatures, like Mean, Variance etc, because we have row polymorphism for objects:

type 'a mean = < mean : 'a; .. >
type 'a mean_opt = < mean_opt : 'a option; .. >
superbobry commented 11 years ago

Okay, I've chosen to stick with modules for now, multivariate normal distribution can be expressed as:

module MultiNormal : sig
  type elt = float array
  include BaseDistribution with type elt := elt
  include Features with type t := t and type elt := elt
  include MLE with type t := t and type elt := elt
end

However, I'm unsure if we should focus on this now: neither SciPy nor R provide multivariate distributions out of the box. So maybe we should delay this until later?

nrlucaroni commented 11 years ago

I prefer modules too. I thought R/scipy provided a fairly full distribution suite, but I see that (looking at [1] and [2]) they have only a few basic ones as you've pointed out. I think at least allowing some generality to implement them is important along with a few basic ones.

[1] - http://docs.scipy.org/doc/numpy/reference/routines.random.html [2] - http://cran.r-project.org/web/views/Distributions.html

superbobry commented 11 years ago

Well, for SciPy a list of supported distributions is a little longer, but still, all of them are univariate.