make_crp's logDensityOfCounts looks suspicious

lenaqr commented 8 years ago

  def logDensityOfCounts(self,aux):
    term1 = scipy.special.gammaln(self.alpha) - scipy.special.gammaln(self.alpha + aux.numCustomers)
    term2 = aux.numTables + math.log(self.alpha + (aux.numTables * self.d))
    term3 = sum([scipy.special.gammaln(aux.tableCounts[index] - self.d) for index in aux.tableCounts])
    return term1 + term2 + term3

In particular term2 has a bare aux.numTables, which seems very strange. Some cursory testing in the shell seems to confirm my suspicions that something weird is going on:

venture[script] > assume a = uniform_continuous(0, 1)
0.939933700303
venture[script] > force(a, 0.5)
[]
venture[script] > assume c = make_crp(a)
[]
venture[script] > observe c() = atom<1>
[0.0]
venture[script] > observe c() = atom<2>
[-1.0986122886681096]
venture[script] > observe c() = atom<3>
[-1.6094379124341005]
venture[script] > observe c() = atom<4>
[-1.9459101490553135]
venture[script] > global_log_likelihood
[1.4254811915223122]
venture[script] > clear
venture[script] > assume a = uniform_continuous(0, 1)
0.954309928007
venture[script] > force(a, 0.1)
[]
venture[script] > assume c = make_crp(a)
[]
venture[script] > observe c() = atom<1>
[0.0]
venture[script] > observe c() = atom<2>
[-2.3978952727983702]
venture[script] > observe c() = atom<3>
[-3.0445224377234226]
venture[script] > observe c() = atom<4>
[-3.4339872044851463]
venture[script] > global_log_likelihood
[2.0313503639751982]

One would expect that global_log_likelihood be equal to the sum of the weights returned by the observes (up to an additive constant), but that doesn't seem to be the case.

lenaqr commented 8 years ago

It looks like actually the + in term2 should be a *, and if I make that change, the global_log_likelihood matches up with the observe weights in this example.

riastradh-probcomp commented 8 years ago

While you're here, please add a citation for this calculation!

riastradh-probcomp commented 8 years ago

It should be really easy to match up the code with a formula found in the literature, so that we can just eyeball discrepancies like this. Otherwise we spend hours or days agonizing over how some fragment of code got there when it was just a transcription error.

axch commented 8 years ago

Meta-comment: This is one of the problems with not having exact specifications of what logDensityOfCounts is supposed to return. Your change seems to suggest that it is possible to correctly implement p(x_i|theta) for crp, but that was not obvious a priori, causing me not think of trusting your validation procedure over the code (also, I didn't smell a rat in term2 being a sum).

axch commented 8 years ago

To close this issue thoroughly, Someone^TM should

[ ] Add a citation to a document describing the appropriate CRP formula
[ ] Add a unit test of CRP demonstrating that the version with the + fails some invariant (e.g., if the log density of counts is supposed to be for p(x_i|theta), then it should agree with the sum of the log densities of the individual data points)
[ ] Fix the formula

fsaad commented 8 years ago

@axch that someone is invariably¹ going to be me, since I am relying on the logDensityOfCounts for the Kepler demo.

Moreover what is the story with the commented out stub for gradientOfLogDensity? I am assuming the writer discovered that crp is discrete, although it seems this commit is the offender which makes no sense to me.

¹Not sure if this usage of "invariably" is correct; basically I mean it is going to be me and I cannot escape.

lenaqr commented 8 years ago

It would still make sense to have a gradient of log density for discrete procedures if some of the inputs are continuous. In this case, the CRP itself has no inputs, so there's nothing to take a gradient with respect to.

However it does make sense to take the gradient of the likelihood of all the counts, with respect to the parameters of the maker procedure, which is what gradientOfLogDensityOfCounts (currently not implemented there) should be.

fsaad commented 8 years ago

Proposal Rewrite the CRP class specializing to the case of a simple one-parameter family in terms of concentration.

Currently the default invocation takes two parameters, one optional (make_crp alpha d=0). The reference to the "two-parameter" CRP comes from http://link.springer.com/article/10.1007/BF01213386 Page 150, Proposition 9 (and a sloppy Wikipedia entry). Pitman's claim is that setting d=0 recovers the CRP and I believe him but I do not think we should bother with d at all:

A "two parameter" CRP is in fact not a crp in the standard sense, it is another model. So a user writing (make_crp 1 1) is obtaining some generalization of crp that rarely appears in the literature.
The formulas carrying around d and alpha are confusing and unverified. On Wikipedia they do not have a reference, or show the equivalence to one-param CRP when d=0, although our code copies them directly (and incorrectly -- see section In the one-parameter case, where \alpha is zero, this simplifies to...). We can simplify and test easier by focusing on one parameter.
It is not clear what probabilistic models we are interested in where CRP takes two parameters anyway. My understanding is Pitman's paper was motivated by the theoretical question, " What is the most general possible distribution of the sequence (Pi) of limiting relative frequencies of classes in order of appearance for an exchangeable random partition \Pi of N?" , rather than actual use cases in probabilistic machine learning.
Clients of CRP in Venturecxx typically ignore the second parameter anyway, and it has caused confusion where it has appeared before i.e. see @axch inline comment here
The two parameter generalization can be a separate class called pep (partially exchangeable partition according to Pitman) if we ever need it, and can serve as a cross-test for the crp implementation.

Ideas?

fsaad commented 8 years ago

Moreover, it occurs to me that the new_table emitted by simulate must always be a fresh atom without reusing previous tables that reached zero counts -- i.e. if the counts of table 1 goes to zero, table 1 should not appear agains.

This behavior is for a user who is storing the table numbers externally, for instance as a parameter for a memoized procedure where the actual values of the atoms matter. If a user

Used atom<1> for the memproc
Did some operations so that the count of table 1 went zero
User resimulated the CRP and received atom<1> again (a reused table) --> the memproc will lookup the now outdated atom<1> rather than simulating again.

I believe this behavior is incorrect since a new CRP table must be in correspondence with a fresh draw from the base measure of the underlying Dirichlet Process.

axch commented 8 years ago

The memproc reuse scenario is somewhat less likely than @fsaad suggests, because unreferenced mem table entries are garbage collected. Basically the only way I can think of to make trouble with only one crp would be to defeat Venture's dependency tracking:

(do (foo <- (sample (cluster_of id)))
      (predict (value_at ,foo))
      <infer a bunch: crp might reuse memmed results because the prediction is holding the mem>)

A more serious problem is that the atoms generated by different crps are not distinct, so can easily cross streams and cause all kinds of problems. If we implement the unique opaque objects needed for #65, we could make all crps emit globally unique such objects, avoiding cross-crp reuse.

Incidentally, my recollection is that trying to avoid these issues was the whole reason why atoms were distinguished from numbers in the first place, but I think only @vkmvkmvkmvkm remembers that history now.

axch commented 8 years ago

Re: proposal to eliminate the second parameter to make_crp:

No objection to giving a more accurate name to the two-parameter version.
No objection to also providing a distinct name for the one parameter version, e.g. as assume make_crp (lambda (al) (make_pep al 0)), if you think that will be clearer to users.
My memory of the motivation for the dispersion parameter was to permit a prior with fatter tails on the number of components than crp produces. My recollection is that with d=0, the prior on cluster count is logarithmic in the number of data points, whereas with d>0 it becomes inverse-polynomial. If this memory is correct, that seems like an important degree of freedom to provide.
Generally, one of our advantages over "the literature" is that if we debug and package an implementation of a model or inference technique, we get to reuse it more times than a typical practitioner, so I would expect us to want one grade of things that are sophisticated enough to "appear rarely".
However, there is the counter point that if it appears rarely then our users are less likely to know what to do with it even if it is there, and it is that much harder for us to implement correctly and debug, because the object is less well studied. On this point of principle I am therefore ambivalent.
That said, I am loath to abandon an object that I was mentally modeling as us already having. Do you actually think there are more bugs than just the one that spawned this thread? [Which we already found, and should just fix.]
For the scenario in which we choose to maintain the two-parameter crp, do you think it would be easier to maintain both the one-parameter crp and the two-parameter one as separate pieces of code with cross-testing than as one piece of code with fixing d=0 for the one-parameter case?

fsaad commented 8 years ago

No objection to giving a more accurate name to the two-parameter version.

No objection to also providing a distinct name for the one parameter version, e.g. as assume make_crp (lambda (al) (make_pep al 0)), if you think that will be clearer to users.

If we keep the current version, renaming the two-param version and having make_crp be a thin wrapper would definitely be the right choice.

Generally, one of our advantages over "the literature" is that if we debug and package an implementation of a model or inference technique, we get to reuse it more times than a typical practitioner, so I would expect us to want one grade of things that are sophisticated enough to "appear rarely".

However, there is the counter point that if it appears rarely then our users are less likely to know what to do with it even if it is there, and it is that much harder for us to implement correctly and debug, because the object is less well studied. On this point of principle I am therefore ambivalent.

I believe that the generalized CRP has no demonstrated use as of now, hinders the simple and more common case where we just need a simple CRP. The difficultly of debugging and maintenance cost seems unjustified, and I would rather our package contain solidly tested and understood samplers rather than having advanced ones for the sake of having them.

-That said, I am loath to abandon an object that I was mentally modeling as us already having. Do you actually think there are more bugs than just the one that spawned this thread? [Which we already found, and should just fix.]

I have manually verified that the fix from @luac second comment yields the logDensityOfCounts for the special case d=0 (see page 4 equation 8, Blei and Gershman for the density). I have not yet found a reliable reference for the equation when d \ne 0 that we have.

There is definitely another bug in the crp.py implementation. To satisfy the rules of probability it is necessary the two-parameter model with (alpha, theta) satisfy page 62, Pitman

either alpha = -k < 0 and theta = m * k for integer m,
or alpha \in [0,1] and theta > -alpha.

(our code uses d for alpha, and alpha for theta).

Neither of these conditions is checked by our code. I imagine that encoding them and ensuring they are maintained in the trace during inference, etc, is an additional headache and the final straw for me in having a standalone simple CRP.

For the scenario in which we choose to maintain the two-parameter crp, do you think it would be easier to maintain both the one-parameter crp and the two-parameter one as separate pieces of code with cross-testing than as one piece of code with fixing d=0 for the one-parameter case?

Yes, if we have a simple CRP which we understand and test properly then we will have a reference which we can confidently use to ensure the generalize CRP recovers the simple CRP.

fsaad commented 8 years ago

RE: The memproc reuse scenario is somewhat less likely...

Cool. It seems that you had a concern about memoizing on the returned atoms, see inline comment from crp.cxx. Did you answer your own question above, or is your comment in the code addressing something else?

axch commented 8 years ago

I think that comment was worrying about what happens within one transition, whereas my thoughts above were across transitions.

fsaad commented 8 years ago

I am in the process of finding a more reliable reference for the distribution over partitions for the generalized CRP (using the discount and strength parameterization), after which I'll do the fix.

Looking at the class test_pymem convinces me that the two parameter CRP is very useful -- it is a solid example where we use the two parameter CRP alongside an extended stick breaking construction (where we use Beta(1-d, alpha+kd) as oppose to Beta(1, alpha) for the stick weights) to construct a Pitman-Yor Process (as oppose to Dirichlet), which is pretty cool and useful.

The implementation remains tricky in that the relationship between alpha and d is quite strict -- currently we have d \in [0,1] and alpha > 0 which is not the most general since alpha > -d is valid, as well as the second case above. How can we ensure these invariants are maintained?

fsaad commented 8 years ago

Finally got the expressions worked out -- writing a documenting and fixing crp.py now.

probcomp / Venturecxx

make_crp's logDensityOfCounts looks suspicious #316