(multivariate) normal operating on sufficient statistics

syclik commented 9 years ago

From @bob-carpenter on August 2, 2014 16:27

Joshua N. Pritikin mentioned on stan-users that the OpenMx package has a multivariate normal that works on sufficient statistics:

In OpenMx, multi_normal has two implementations. One implementation handles data in the form of 1 case per row. The other implementation handles data as a covariance matrix (and means). These two approaches are equivalent for certain data (no missingness, etc).

I would think the second implementation would be much more efficient because the sufficient statistics could be computed just once.

Does anyone know what the density over the sufficient stats would look like?

Copied from original issue: stan-dev/stan#822

syclik commented 9 years ago

From @jpritikin on August 3, 2014 0:22

See Theorem 6.2.4 (page 78) in http://people.virginia.edu/~jnp3bc/Timo2013.pdf

syclik commented 9 years ago

From @jpritikin on August 5, 2014 2:51

BTW, you can use multi_normal on sufficient statistics to handle data with missingness. The trick is to partition the data by missingness pattern. Once you have the data partitioned then you compute the covariance matrix for each partition and add all the partitions together like a multigroup model.

syclik commented 9 years ago

From @randommm on October 23, 2014 2:18

According to what I calculated, for the univariate normal distribution, the derivative of the "location parameter sufficient statistic" parameter depends on each observed values (and not declaring each observed values is the purpose of sufficient statistic).

But anyway, I also think that it makes sense to force the sufficient statistic parameters to be declared as data/constants only... but how could we archive this at Stan language level? Via parser? (I know that in C++ level, we can do it in a similar way we handle integer parameters).

syclik commented 9 years ago

From @bob-carpenter on October 23, 2014 2:51

There's no way to restrict functions to just data. The ODE solver can do it for some of its arguments because it's a special expression type, not a function.

But I dion't understand why you'd want to restrict them here --- are the derivatives hard to calculate?

syclik commented 9 years ago

From @randommm on October 23, 2014 3:36

-- I hope my latex images will "parse" on github --

Hi, because I think the derivative of one the sufficient statistic parameters depends on all observations.

That's the likelihood of a normal distribution using sufficient statistics (BDA page 64 with slightly different notation):

$f(\bar{y}, s, n, \mu, \sigma) = \log(\sigma^{-n}) - \frac{s^2 + n(\bar{y}-\mu)^2}{2\sigma^2} - log(2\pi)$

Where: $\bar{y} = \frac{1}{n} \sum_{i=1}^{n}yi$ and $s^2 = \sum{i=1}^{n}(y_i-\bar{y})^2$

Then: $\frac{\partial f(\bar{y}, s, n, \mu, \sigma)}{\partial \bar{y}} = - \frac{1}{2\sigma^2} ( \frac{\partial f(\bar{y}, s, n, \mu, \sigma)}{\partial (s^2)} \frac{\partial (s^2)}{\partial \bar{y}} + \frac{\partial f(\bar{y}, s, n, \mu, \sigma)}{\partial (n(\bar{y}-\mu)^2)} \frac{\partial (n(\bar{y}-\mu)^2)}{\partial \bar{y}})$

But: $\frac{\partial (s^2)}{\partial \bar{y}} = 2 \sum_{i=1}^{n} (\bar{y} - y_i)$

Therefore $\frac{\partial f(\bar{y}, s, n, \mu, \sigma)}{\partial \bar{y}}$ depends on all $y_i$.

I'm not sure if my calculations are correct though.

On 14-10-23 12:51 AM, Bob Carpenter wrote:

There's no way to restrict functions to just data. The ODE solver can do it for some of its arguments because it's a special expression type, not a function.

But I dion't understand why you'd want to restrict them here --- are the derivatives hard to calculate?

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/822#issuecomment-60186146.

syclik commented 9 years ago

From @bob-carpenter on October 23, 2014 5:38

Isn't there just a density that involves the sufficient statistics? So if I have sample mean m and sample variance v, isn't it just a density p(m,v | ...)? In which case, we only need to propagate derivatives down to m and v.

P.S. There's a Preview tab where you can see what comes out. I don't think there' an easy way to include LaTeX.

syclik commented 9 years ago

From @randommm on October 23, 2014 11:2

I uploaded a version of my last email in pdf here: http://docdroid.net/jw3c

Note that the density only involves the sufficient statistics, but, s_squared implicitely involves y_bar - y_i, and therefore, y_i will show up in y_bar derivatives.

syclik commented 9 years ago

From @bob-carpenter on October 23, 2014 16:24

On Oct 23, 2014, at 7:02 AM, Marco Inacio notifications@github.com wrote:

I uploaded a version of my last email in pdf here: http://docdroid.net/jw3c

Thanks --- definitely easier to read. You can attach pdfs to mail to the group, too.

Note that the density only involves the sufficient statistics, but, s_squared implicitely involves y_bar - y_i, and therefore, y_i will show up in y_bar derivatives.

But that's true in the usual normal distribution, too.

The point is just that there's a sensible density:

p(y-bar, s, n | mu, sigma)

for

y-bar in (-inf, inf) real s in (0, inf) real n in (0, 1, ...) integer

Presumably when the dust settles the density will be the same as if you had n observations with mean of y-bar and sd (MLE form, not dof adjusted) of s.

So it's only going to be a time-saver relative to using the vectorized density y ~ normal(mu,sigma) if y is data.

If you look at how normal_log() is implemented now, you'll see that it basically does the same calculation as is indicated. And indeed has to push the derivatives back down to the components of y if y is a variable.

Bob=

syclik commented 9 years ago

From @randommm on October 23, 2014 16:38

Yes, but what I meant is that the derivative of y_bar parameter (the mean of each y_i) includes each yi. But doing the calculations it seems it doesn't, since sum{n} \bar{y} - y_i will always be zero. So this might not be a problem after all.

syclik commented 9 years ago

From @bob-carpenter on October 23, 2014 16:53

If y_bar and s are defined in terms of some underlying vector of values y, then the chain rule will take care of passing the derivatives down to y where necessary. It's not part of the definition of this sufficient stat version of normal.

Noting that some operations cancel may help eliminate some redundant calculations, though.

syclik commented 9 years ago

From @randommm on October 24, 2014 22:35

I forgot to ask, is normal_ss_log a good name for it?

syclik commented 9 years ago

From @bob-carpenter on October 25, 2014 0:27

I would prefer something more verbose, like

normal_sufficient_log()

or even

normal_sufficient_stats_log()

or

normal_sufficient_statistics_log();

But the real problem I see is that what we really want is some kind of list-based sampling notation, as in:

(y_mean, y_sd) ~ normal_sufficient(N,mu,sigma);

So it's the bivariate density of the random variable composed of drawing N normal(mu,sigma) then computing y_mean and y_sd.

I don't think it makes sense to have that be a vector that's required to be size two.

Bob

On Oct 24, 2014, at 6:35 PM, Marco Inacio notifications@github.com wrote:

I forgot to ask, is normal_ss_log a good name for it?

— Reply to this email directly or view it on GitHub.

syclik commented 9 years ago

From @randommm on October 29, 2014 23:45

ok, I think normal_sufficient_log() is good. The C++ part for the univariate case is ready and tested.

About that list based sampling notation, it would be great. But, that's purely parser-related, right? If so, I'm sure I won't be able to do it.

syclik commented 9 years ago

From @bob-carpenter on October 30, 2014 2:41

On Oct 29, 2014, at 7:45 PM, Marco Inacio notifications@github.com wrote:

ok, I think normal_sufficient_log() is good. The C++ part for the univariate case is ready and tested.

About that list based sampling notation, it would be great. But, that's purely parser-related, right? If so, I'm sure I won't be able to do it.

Mainly it needs a parser upgrade with an associated C++ data type.

Ben and Andrew seem to be moving to wanting to use increment_log_prob everywhere to remove any confusion that ~ is actually doing sampling and assignment.

Bob=

syclik commented 9 years ago

From @andrewgelman on October 30, 2014 2:42

Yes, but I don’t think I literally want “increment_log_prob” everywhere, as it would represent a lot of typing!

On Oct 29, 2014, at 10:41 PM, Bob Carpenter notifications@github.com wrote:

On Oct 29, 2014, at 7:45 PM, Marco Inacio notifications@github.com wrote:

ok, I think normal_sufficient_log() is good. The C++ part for the univariate case is ready and tested.

About that list based sampling notation, it would be great. But, that's purely parser-related, right? If so, I'm sure I won't be able to do it.

Mainly it needs a parser upgrade with an associated C++ data type.

Ben and Andrew seem to be moving to wanting to use increment_log_prob everywhere to remove any confusion that ~ is actually doing sampling and assignment.

Bob= — Reply to this email directly or view it on GitHub.

syclik commented 9 years ago

From @randommm on October 30, 2014 3:26

I like that idea. Using something other than "increment_log_prob" would also allow solving #496 in the process, I think.

But, well, if that's the case, maybe it's better to leave normal_sufficient in the traditional way for now? y_mean ~ normal_sufficient(s_sq, N,mu,sigma); Ugly for now, but ok later.

syclik commented 9 years ago

From @bob-carpenter on November 1, 2014 18:39

You need to find an editor with auto-complete :-) Like say, emacs.

Seriously, though, do you have an alternative syntax? It'd be very easy to add a new syntax if you can come up with one. It'd be something that operated on any Stan expression (it could take scalars, vectors, arrays, whatever, and just adds all entries to the log prob accumulator).

Bob

On Oct 29, 2014, at 10:42 PM, Andrew Gelman notifications@github.com wrote:

Yes, but I don’t think I literally want “increment_log_prob” everywhere, as it would represent a lot of typing!

On Oct 29, 2014, at 10:41 PM, Bob Carpenter notifications@github.com wrote:

On Oct 29, 2014, at 7:45 PM, Marco Inacio notifications@github.com wrote:

ok, I think normal_sufficient_log() is good. The C++ part for the univariate case is ready and tested.

About that list based sampling notation, it would be great. But, that's purely parser-related, right? If so, I'm sure I won't be able to do it.

Mainly it needs a parser upgrade with an associated C++ data type.

Ben and Andrew seem to be moving to wanting to use increment_log_prob everywhere to remove any confusion that ~ is actually doing sampling and assignment.

Bob= — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

syclik commented 9 years ago

From @andrewgelman on November 2, 2014 2:7

I dunno, “lp++”? The real difficulty is that people are used to that Bugs syntax. We can talk with Ben at some point, maybe we could come up with somethnig good. I do think there are advantages to not using the “~” framework as it can be confusing (for example, we can have a sequence of assignments of the same variable). A

On Nov 1, 2014, at 2:39 PM, Bob Carpenter notifications@github.com wrote:

You need to find an editor with auto-complete :-) Like say, emacs.

Seriously, though, do you have an alternative syntax? It'd be very easy to add a new syntax if you can come up with one. It'd be something that operated on any Stan expression (it could take scalars, vectors, arrays, whatever, and just adds all entries to the log prob accumulator).

Bob

On Oct 29, 2014, at 10:42 PM, Andrew Gelman notifications@github.com wrote:

Yes, but I don’t think I literally want “increment_log_prob” everywhere, as it would represent a lot of typing!

On Oct 29, 2014, at 10:41 PM, Bob Carpenter notifications@github.com wrote:

On Oct 29, 2014, at 7:45 PM, Marco Inacio notifications@github.com wrote:

ok, I think normal_sufficient_log() is good. The C++ part for the univariate case is ready and tested.

About that list based sampling notation, it would be great. But, that's purely parser-related, right? If so, I'm sure I won't be able to do it.

Mainly it needs a parser upgrade with an associated C++ data type.

Ben and Andrew seem to be moving to wanting to use increment_log_prob everywhere to remove any confusion that ~ is actually doing sampling and assignment.

Bob= — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

syclik commented 9 years ago

From @betanalpha on November 2, 2014 9:23

Not coming from BUGS land I love the ~ notation. I read it as “is distributed as” instead of “is sampled from”. From this perspective we shouldn’t think about it as an assignment and things start to make sense.

On Nov 2, 2014, at 2:07 AM, Andrew Gelman notifications@github.com wrote:

I dunno, “lp++”? The real difficulty is that people are used to that Bugs syntax. We can talk with Ben at some point, maybe we could come up with somethnig good. I do think there are advantages to not using the “~” framework as it can be confusing (for example, we can have a sequence of assignments of the same variable). A

On Nov 1, 2014, at 2:39 PM, Bob Carpenter notifications@github.com wrote:

You need to find an editor with auto-complete :-) Like say, emacs.

Seriously, though, do you have an alternative syntax? It'd be very easy to add a new syntax if you can come up with one. It'd be something that operated on any Stan expression (it could take scalars, vectors, arrays, whatever, and just adds all entries to the log prob accumulator).

Bob

On Oct 29, 2014, at 10:42 PM, Andrew Gelman notifications@github.com wrote:

Yes, but I don’t think I literally want “increment_log_prob” everywhere, as it would represent a lot of typing!

On Oct 29, 2014, at 10:41 PM, Bob Carpenter notifications@github.com wrote:

On Oct 29, 2014, at 7:45 PM, Marco Inacio notifications@github.com wrote:

ok, I think normal_sufficient_log() is good. The C++ part for the univariate case is ready and tested.

About that list based sampling notation, it would be great. But, that's purely parser-related, right? If so, I'm sure I won't be able to do it.

Mainly it needs a parser upgrade with an associated C++ data type.

Ben and Andrew seem to be moving to wanting to use increment_log_prob everywhere to remove any confusion that ~ is actually doing sampling and assignment.

Bob= — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

syclik commented 9 years ago

From @andrewgelman on November 2, 2014 9:58

Mike: Here’s a difficulty with reading ~ as “distributed as”: Consider the following model: theta ~ normal(0,1); theta ~ normal(0,1); This pair of statements is equivalent to theta ~ normal(0,1/sqrt(2)); and this is clear from the increment_log_prob perspective, but it’s not so clear if you think of it as two “theta is distributed as” statements. A

On Nov 2, 2014, at 3:23 AM, Michael Betancourt notifications@github.com wrote:

Not coming from BUGS land I love the ~ notation. I read it as “is distributed as” instead of “is sampled from”. From this perspective we shouldn’t think about it as an assignment and things start to make sense.

On Nov 2, 2014, at 2:07 AM, Andrew Gelman notifications@github.com wrote:

I dunno, “lp++”? The real difficulty is that people are used to that Bugs syntax. We can talk with Ben at some point, maybe we could come up with somethnig good. I do think there are advantages to not using the “~” framework as it can be confusing (for example, we can have a sequence of assignments of the same variable). A

On Nov 1, 2014, at 2:39 PM, Bob Carpenter notifications@github.com wrote:

You need to find an editor with auto-complete :-) Like say, emacs.

Seriously, though, do you have an alternative syntax? It'd be very easy to add a new syntax if you can come up with one. It'd be something that operated on any Stan expression (it could take scalars, vectors, arrays, whatever, and just adds all entries to the log prob accumulator).

Bob

On Oct 29, 2014, at 10:42 PM, Andrew Gelman notifications@github.com wrote:

Yes, but I don’t think I literally want “increment_log_prob” everywhere, as it would represent a lot of typing!

On Oct 29, 2014, at 10:41 PM, Bob Carpenter notifications@github.com wrote:

On Oct 29, 2014, at 7:45 PM, Marco Inacio notifications@github.com wrote:

ok, I think normal_sufficient_log() is good. The C++ part for the univariate case is ready and tested.

About that list based sampling notation, it would be great. But, that's purely parser-related, right? If so, I'm sure I won't be able to do it.

Mainly it needs a parser upgrade with an associated C++ data type.

Ben and Andrew seem to be moving to wanting to use increment_log_prob everywhere to remove any confusion that ~ is actually doing sampling and assignment.

Bob= — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

syclik commented 9 years ago

From @betanalpha on November 2, 2014 9:14

I guess I don’t think of “distributed as” as being as exclusive as an explicit assignment. Perhaps

theta +~ normal(0, 1)

is more in line with increment_log_prob? If only it weren’t super annoying to type.

On Nov 2, 2014, at 8:58 AM, Andrew Gelman notifications@github.com wrote:

Mike: Here’s a difficulty with reading ~ as “distributed as”: Consider the following model: theta ~ normal(0,1); theta ~ normal(0,1); This pair of statements is equivalent to theta ~ normal(0,1/sqrt(2)); and this is clear from the increment_log_prob perspective, but it’s not so clear if you think of it as two “theta is distributed as” statements. A

On Nov 2, 2014, at 3:23 AM, Michael Betancourt notifications@github.com wrote:

Not coming from BUGS land I love the ~ notation. I read it as “is distributed as” instead of “is sampled from”. From this perspective we shouldn’t think about it as an assignment and things start to make sense.

On Nov 2, 2014, at 2:07 AM, Andrew Gelman notifications@github.com wrote:

I dunno, “lp++”? The real difficulty is that people are used to that Bugs syntax. We can talk with Ben at some point, maybe we could come up with somethnig good. I do think there are advantages to not using the “~” framework as it can be confusing (for example, we can have a sequence of assignments of the same variable). A

On Nov 1, 2014, at 2:39 PM, Bob Carpenter notifications@github.com wrote:

You need to find an editor with auto-complete :-) Like say, emacs.

Seriously, though, do you have an alternative syntax? It'd be very easy to add a new syntax if you can come up with one. It'd be something that operated on any Stan expression (it could take scalars, vectors, arrays, whatever, and just adds all entries to the log prob accumulator).

Bob

On Oct 29, 2014, at 10:42 PM, Andrew Gelman notifications@github.com wrote:

Yes, but I don’t think I literally want “increment_log_prob” everywhere, as it would represent a lot of typing!

On Oct 29, 2014, at 10:41 PM, Bob Carpenter notifications@github.com wrote:

On Oct 29, 2014, at 7:45 PM, Marco Inacio notifications@github.com wrote:

ok, I think normal_sufficient_log() is good. The C++ part for the univariate case is ready and tested.

About that list based sampling notation, it would be great. But, that's purely parser-related, right? If so, I'm sure I won't be able to do it.

Mainly it needs a parser upgrade with an associated C++ data type.

Ben and Andrew seem to be moving to wanting to use increment_log_prob everywhere to remove any confusion that ~ is actually doing sampling and assignment.

Bob= — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

syclik commented 9 years ago

From @andrewgelman on November 2, 2014 9:18

I don’t think theta +~ normal(0, 1) is any better, because we have the same problem with the pair of expressions: theta +~ normal(0, 1); theta +~ normal(0, 1); The problem, I think, is that fundamentally these are not descriptions of theta or assignments to theta, but that’s what the ~ and <- notations look like.

To put it another way, the following 2 statements are equivalent in Stan:

(1) theta ~ normal(0,1); or (2) 0 ~ normal(theta,1);

But they look a lot different!

On Nov 2, 2014, at 4:14 AM, Michael Betancourt notifications@github.com wrote:

I guess I don’t think of “distributed as” as being as exclusive as an explicit assignment. Perhaps

theta +~ normal(0, 1)

is more in line with increment_log_prob? If only it weren’t super annoying to type.

On Nov 2, 2014, at 8:58 AM, Andrew Gelman notifications@github.com wrote:

Mike: Here’s a difficulty with reading ~ as “distributed as”: Consider the following model: theta ~ normal(0,1); theta ~ normal(0,1); This pair of statements is equivalent to theta ~ normal(0,1/sqrt(2)); and this is clear from the increment_log_prob perspective, but it’s not so clear if you think of it as two “theta is distributed as” statements. A

On Nov 2, 2014, at 3:23 AM, Michael Betancourt notifications@github.com wrote:

Not coming from BUGS land I love the ~ notation. I read it as “is distributed as” instead of “is sampled from”. From this perspective we shouldn’t think about it as an assignment and things start to make sense.

On Nov 2, 2014, at 2:07 AM, Andrew Gelman notifications@github.com wrote:

I dunno, “lp++”? The real difficulty is that people are used to that Bugs syntax. We can talk with Ben at some point, maybe we could come up with somethnig good. I do think there are advantages to not using the “~” framework as it can be confusing (for example, we can have a sequence of assignments of the same variable). A

On Nov 1, 2014, at 2:39 PM, Bob Carpenter notifications@github.com wrote:

You need to find an editor with auto-complete :-) Like say, emacs.

Seriously, though, do you have an alternative syntax? It'd be very easy to add a new syntax if you can come up with one. It'd be something that operated on any Stan expression (it could take scalars, vectors, arrays, whatever, and just adds all entries to the log prob accumulator).

Bob

On Oct 29, 2014, at 10:42 PM, Andrew Gelman notifications@github.com wrote:

Yes, but I don’t think I literally want “increment_log_prob” everywhere, as it would represent a lot of typing!

On Oct 29, 2014, at 10:41 PM, Bob Carpenter notifications@github.com wrote:

On Oct 29, 2014, at 7:45 PM, Marco Inacio notifications@github.com wrote:

ok, I think normal_sufficient_log() is good. The C++ part for the univariate case is ready and tested.

About that list based sampling notation, it would be great. But, that's purely parser-related, right? If so, I'm sure I won't be able to do it.

Mainly it needs a parser upgrade with an associated C++ data type.

Ben and Andrew seem to be moving to wanting to use increment_log_prob everywhere to remove any confusion that ~ is actually doing sampling and assignment.

Bob= — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

bob-carpenter commented 8 years ago

More discussion (perhaps dupe?) in #39

randommm commented 7 years ago

This one was merged on stan-math, but not exposed to stan-dev due to that conceptual problem of whether it should be ok to have y_mean ~ normal_sufficient(y_squared, n, mu, sigma) and normal_sufficient_lpdf(y_mean | y_squared, n, mu, sigma) despite the fact that y_squared is also data.

I saw some time ago that there was some discussions around changing the language of Stan, but lost track of it due to the amount of traffic on the list, so will it happen, and if yes, will it automatically solve this dilemma?

bob-carpenter commented 7 years ago

What will solve the problem is having tuples. So what we'd have is something like replacing:

y ~ normal(mu, sigma);

with

(mean(y), sd(y)) ~ normal_sufficient(size(y), mu, sigma);

sum_of_squares(y) is better than sd or variance? We wouldn't have to make it the typical sufficient statistics from the exponential family.

Bob

On Sep 6, 2017, at 8:51 PM, Marco Inacio notifications@github.com wrote:

This one was merged on stan-math, but not exposed to stan-dev due to that conceptual problem of whether it should be ok to have y_mean ~ normal_sufficient(y_squared, n, mu, sigma) and normal_sufficient_lpdf(y_mean | y_squared, n, mu, sigma) despite the fact that y_squared is also data.

I saw some time ago that there was some discussions around changing the language of Stan, but lost track of it due to the amount of traffic on the list, so will it happen, and if yes, will it automatically solve this dilemma?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

randommm commented 7 years ago

I preffer sum_of_squares because sd might be ambiguos as to wheter one should divide by n or by (n-1). If I remember well, I also avoids one float point operation. But I don't mind changing it to sd.

The tuple solution is very interesting, it could be reused in the future in case one adds minimal sufficient statistics of other distributions, it's not available in the parser right now, is it? I'll create a issue for this on stan-dev/stan.

bob-carpenter commented 7 years ago

Good point about sd() being ambiguous. Sum of squares sounds good then.

No, tuples aren't available yet. But @mitzimorris is working on it.

stan-dev / math

(multivariate) normal operating on sufficient statistics #38