Open MTKnife opened 6 years ago
The normal in the sampling follows from the math and the generative model. I am not sure what is exactly meant by interval, but I do think changing the model may lead to better results depending on the task. Are you saying that the implementation of the generative model is off? Or that a new generative model would be better?
OK, let me walk through it from the beginning so that you can correct any mistaken assumptions on my part.
The lambda parameter modifies the priors drawn from the delta matrix: a high lambda (close to 1) gives strong priors for the known topics, while a smaller lambda gives weaker priors.
We assume that some of the known topics will diverge more from the corresponding knowledge source documents than others (some should have weaker priors than others). Therefore, the model expresses lambda as an (approximated) normal distribution rather than as a scalar, with mean of mu and standard deviation of sigma. The (continuous) normal distribution is approximated by specifying a number (A) of evenly spaced discrete points along the distribution (by default, 5 points ranging from -2 mu to 2 mu); each point is used to calculate a discrete interval along the distribution, with each interval approximated by the value of the probability density function for the associated point (this is your "norm" vector in the code).
My understanding (based on my reading of your article) is that this approximation provides a range of potential values of lambda for each topic; weighting the sampling of each word by the probability density of lambda's distribution makes values of lambda closer to mu more likely to have a substantial effect on the sampling. The intended effect, I think, is to (implicitly) estimate the appropriate value of lambda for each topic-word combination; the intention of the weighting is to center those estimates around mu. Obviously, in reality there are dependencies between the different potential values of lambda (if one lambda is the right value, all others are wrong), but with enough iterations, that shouldn't matter, and in the end we should get a set of phi estimates that suggests that the true value of each lambda is somewhere between two of the A discrete points along the distribution.
However, given the way the code calculates the probability matrix ("pr"), I don't think that's what's actually happening. The code correctly generate a range of lambda values and raises each delta to those values ("delta_pows"), creating an array of A hyperparameters for the topic-word combination corresponding to each delta. The code also calculates the norm vector and multiplies each probability calculation (whether in "Populate_prob" or "Calculate_phi") by that weight.
So far, so good. The problem, I think, is that the code makes a single count for each word for each known topic ("n_w[t]"). It then multiplies this count by each of the delta^lambda values for each topic-word combination (by default, 5 of them), weights the products by the "norm" vector, and sums them all up (I'm leaving out the other terms, for the sake of simplicity, but those, like the topic-word count, are all constant for a given topic-word combination).
Mathematically, the result of this is not all that different from using a single value of lambda (equal to mu): because the deltas are raised to the powers of the (5) different lambda values before we take a weighted average, and all of the lambda values are typically less than 1, the values of phi (and the estimates in the "pr" vector) will all be a little larger than they would be with a single value of lambda; for larger deltas, the effect is more pronounced, but it's still not big. It's not quite the same as using a single lambda but it's pretty close.
If you're actually trying to accomplish what I understand that you want to accomplish, the change you'd need to make in the code is to skip the summation step in "Populate_prob", and to pass each of the A individual probability estimates back to "Pop_sample". In effect, you'd treat each topic-lambda combination as a separate topic, and preserve a separate count for each in "n_w" (meaning adding a third dimension to that matrix for the known topics). If everything goes correctly, the word counts for each topic-word combination should increase for the two values of lambda that bracket the true value (or for just one value, if the true value is greater than two standard deviations from the mean), while the counts for other values of lambda should shrink towards 0.
I do have one other question: given Bayesian assumptions, the larger the corpus, the less the effect of the priors from the knowledge source should be; for an infinitely large corpus, the knowledge source would have no effect at all. Is that intentional? That is, does your model assume that the data will eventually reveal the proper topics, given a large enough corpus, but simply needs a little boost in the beginning, or are you trying to impose a type of order that couldn't be drawn from the data alone? Either objective might be correct for different applications, but which one you pick could influence how you deal with the issue I originally raised (for example, if you want to let the data speak, I'm not sure you need an algorithm to find the right value of lambda, since the effects of the deltas are decreasing even as you're estimating the lambdas they modify).
I think that is a good approach, keeping a count of the normal interval is good as it implies a Dirichlet prior as opposed to a normal one. One of the co-authors initially suggested this method but we inevitably went with a normal prior to simplify the model. The Dirichlet has a lot of computation advantages and is more Bayesian so eventually we will go this direction. Keeping the counts with the model as is and adding another dimension to the known topics is fine as well though if I understand it correctly I think it would work out to be the same thing.
A large corpus or a small corpus and a large knowledge source are a weakness of the model. I think this can be mitigated by normalizing the counts but I've been lucky in that the data sets and knowledge sources I work with are pretty similar and do not skew the results one way or the other.
As for the original issue...I'm pretty sure that if you add that dimension, it won't work out to the same thing. I had a hard time with the explanation above, but another way to think about it is in terms of information:
In the current implementation, the code calculates a separate value of "pr" (probability of a given word) for each interval for each known topic. Thus, we have a separate bit of information for each interval along the probability density function. When all of those values of "pr" are summed up for a given topic, we lose all of the information that distinguishes one interval from another--all we have left is information for the topic as a whole, and that's what gets passed to the next iteration. However, we if we treat each interval as a topic, we keep that information and pass it through to the next iteration, and so the phi estimates for the different intervals start to diverge.
In my particular application, I need to be able to impose topics on the data (the users want to group things by certain topics, regardless of what the data say), and since I'm writing an application designed for a continuous stream of data, I can't just rely on priors. What I've been thinking about is something like the below for computing phi for the known topics (using the notation from your article; also, note the norming of the deltas by use of a proportion rather than a raw count):
lambda delta{wt} / delta{t} + (1 - lambda) ((n{wt} + beta) / (n{t} + V * beta))
That's not quite right, though, since the second term (the standard formula for LDA) shouldn't include the portions of the count that would be accounted for by the deltas. One approach is to get rid of the second term entirely, and just let that variance be handled by the unknown topics (admittedly, we'd get some non-source topics that tended to turn up in the same documents as the known topics, but that may be OK--and we no longer have to specify a lambda). To me, though, it seems more practical to follow your original approach of allowing the known topics to vary somewhat from the knowledge source--we want to to be able to identify documents that don't use all of the same words, but are about the same thing as the original source document for a given topic. For Gibbs sampling, this would involve storing a separate count for the words assigned to the second (variable) term for each topic, which we'll label "n{wtv}". The second term then becomes:
(n{wtv} + beta) / (n{t} + V * beta)
In other words, we've got two generative functions for a given word in each topic. The difference between doing it this way and simply treating the fixed term as a topic on its own is that the use of a single n{t} total creates a dependency between the two terms. Note that we don't need the (1- lambda) anymore, since using n{wtv} instead of n{wt} in the second term accomplishes the same effect. We might however need to use a smaller beta for known topics than for the unknown topics, probably (1 - lambda) * beta.
I've still got to work out if there's a reasonably simple equivalent method for doing this with variational Bayesian inference.
For an online/streaming application, the above would be slightly different--for each mini-batch, you'd use the posterior of the last mini-batch (the estimated phi) as the prior for the current mini-batch (replacing beta).
Whoops, it's not just the single n{t} that binds the fixed and variable terms together--it's also the fact that they share the same column in the theta matrix.
After a couple of months of working with the algorithm (including both modifying these scripts and writing another application that implements it with online variational Bayes), I think I see a problem with the math in the "Pop_sample" and "Populate_prob" functions.
The use of the probability density function to allow variation in the extent to which each topic matches the knowledge source is very clever; however, I don't think the code accomplishes what was intended, because, when calculating the "pr" vector, it aggregates the values for all A intervals for each known topic before sampling each word.
The effect of performing the sampling in this fashion is that no information is extracted that would allow the algorithm to weight the estimation towards some intervals and away from others (which, I believe, is the point of using the probability density function in the first place). It looks like what should actually happen is to treat each interval as a separate topic (though the "calculate_phi" function can probably stay the way it is, since it's not used in the estimation).
Mind you, breaking up the known topics in this way would reduce the statistical power of the algorithm for the known topics--and with small corpora, that could weight the results toward the (undivided, hence more accurately estimable) unknown/arbitrary topics.