Iterate on features for the classifier

Quick summary of current features; will add on.

(Implemented) Core features:
- n-gram of POS consisting of n-1 words on the stack and the first word on the buffer
Calculating an uneasiness factor
- multiply the probabilities of each of the actions taken, and if the total probability at a certain state dips below a certain threshold, label the current state as being uneasy
- maintain a beam of size k and calculate the entropy of their predicted actions or states (POS tag, action, which word is on the stack, etc.) equal to -sum(x_1_log(x_1) + x_2_log(x_2) + ... + x_n*log(x_n)) where x_i symbolizes the probability of a certain action.
- maintain a beam of size 1 and r random paths such that at every step, each of the random paths choose an action at random. The idea is that the random paths symbolize the most likely states of the parser, which we can compare with that of the single beam
- break up into separate uneasiness factors for each action (tagging with POS, parser actions, etc.)
Sub-dependency tree of k words on stack and first word on buffer
- features consist of arcs between k+2 nodes, including k words, a node representing all words before those k words, and a node representing all words after those k words
Context-Free Grammar constraints on the arcs
- upon REDUCing, label the parent node with a new POS tag of the variety PP, VP, NP, etc.; impose a loss or some sort of consequence for not-well-formed grammars

Note that these are features of the state, used to classify whether that state has made a recent error. ...

[This long reply was hard to read on github because it wasn't parsed with markdown for some reason, and isn't no matter how much I edit it. So I have reposted it below.]

As a small update. I have implemented extraction of N-gram features on Maltparser. I am also keeping track of all arcs on every item on the parsing stack. Incoming and outgoing (as well as the indexes of where all of the arcs are pointing to). So that is ready to go for features as well.

We are going through the POS tags within the conll datasets that we have and we want to know what the POS tags mean. I found a page listing the Penn Treebank POS tags.

http://bulba.sdsu.edu/jeanette/thesis/PennTags.html

Some selected examples: CC - Coordinating Conjunction IN - Preposition or Subordinating Conjunction JJ - Adjective NN - Noun VBD - Verb, past tense

Yes, those are the Penn Treebank tags. Are they used for the English data? (Probably so.) But I think each language will have its own tagset. Do you want to know about the other languages too? We should probably stick to English for now, so that we can use the other languages as test data.

On Mon, May 27, 2013 at 5:22 PM, Juneki Hong notifications@github.comwrote:

We are going through the POS tags within the conll datasets that we have and we want to know what the POS tags mean. I found a page listing the Penn Treebank POS tags.

http://bulba.sdsu.edu/jeanette/thesis/PennTags.html

Some selected examples: CC - Coordinating Conjunction IN - Preposition or Subordinating Conjunction JJ - Adjective NN - Noun VBD - Verb, past tense

— Reply to this email directly or view it on GitHubhttps://github.com/karuiwu/reparse/issues/6#issuecomment-18515196 .

Penn Treebank tags: We are working with English data. Hearing about other tag sets used in other languages sounds interesting. I think we could work with English just out of familiarity and try out our parser on other languages afterwards.

The uncertainty feature: I was thinking about the uncertainty feature that we would need to detect when to use the revision operators. We had talked about it one of our meetings some months ago, and I tried to flesh out some of the notes I wrote.

I think that the uncertainty feature will be a number further describing the state of the parse, and there could be a threshold at which revision operators could be used. The threshold could be set by us or derived during training. Other than this, there are a few different approaches that we could take.

One way I think we could encode an uncertainty feature is to attach an "uncertainty score" to each possible action that the parser could take at each step, and by adding up all (or the previous K) of these decisions we could find out how uncertain a parse is at a particular point. How we could derive a score for each action would come from the state of the parser: -The score of Arc Left decisions could come from how likely the queue node POS tag could take the stack node POS tag as a child. -The score of Arc Right decisions could come from how likely the stack node POS tag could take the queue node POS tag as a child. -The score of Shift could come from how likely it is to have the resulting POS Ngram sequence on the stack. -The score of Reduce could come from how likely it is for the stack node POS tag to have no children.
Another approach to derive the uncertainty of a parse is to take a random sample R from all possible parses on the beam. We could compute how similar this sample is with the current parse. If all of the parses were consistently similar to the state of our current parse, then the parses "agree" and we can assign a lower uncertainty value. On the other hand, if the sample of parses are all very different or consistently different from our current parse, then we can have a much higher uncertainty value. We could determine "similarity" by comparing features of the parsing state. For example across all the different parses, we could sum up how many times the word on the top stack matched the current parse's stack word, how many times the queue word matches our current parse's queue word, and how many times the decision made agrees as well. If we had R parses and F features, and rf features across all the parses matched, then we would have a score of (rf) / (R * F).

Juneki, I wonder whether you read my long original reply to this issue 2 months ago, because it gave a more detailed treatment of the issues that you're raising.

The formatting of my reply seems to have been screwed up (maybe email replies are not parsed with Markdown??) so I'll repost it below and hope that works better.

Note that these are features of the state, used to classify whether that state has made a recent error. If you are using a beam of size k, then each of the k states in the beam will have its own features and should be classified separately. This will be useful when we get to backtracking, since each of the k states may make its separate decision about whether to backtrack and revise.

All of the features below tend to have more positive values if we are more worried that the current state is wrong. So they will tend to have positive weights in a classifier that predicts that we have made a recent error.

- multiply the probabilities of each of the actions taken, and if the total probability at a certain state dips below a certain threshold, label the current state as being uneasy

There is no reason to use a threshold and make it a binary feature -- you can make it a real-valued feature. Let's call this the uneasiness feature.

For real-valued features, it is best to define the feature value in units of log-probability (for example, it could be a log-probability or an entropy) because you are going to exponentiate it again in the log-linear model to get a probability, like this: probability = (1/Z) exp ( ... + (feature weight)(feature value) + ...) = (1/Z) exp ( ... + (feature weight)(log p) + ...) = (1/Z) ... * p^(feature weight) * ... so that you're basically multiplying p by some other probabilities, although you learn to emphasize or de-emphasize p by first taking it to a learned power that is greater or less than 1.

Also, in this case, multiplying all of the action probabilities is unlikely to work because it means that the later you are in the sentence, the more uneasy you are. I would instead use a time-decaying average of the log-probabilities of the most recent actions.

That is, if the actions that led up to the state at time t were a_1, a_2, ... a_t, then define the uneasiness feature at time t by uneasinesst = sum{i=1}^t (-log p(a_i)) \gamma^{t-i} where 0 < gamma <= 1 so that less recent actions are less influential. Now the uneasiness will no longer increase linearly as a function of t, but will level off quickly, particularly for small gamma. Of course, it will still fluctuate according to the probability of the most recent actions, and these fluctuations are what the feature is designed to measure.

The uneasiness is easy to update at each action by a single multiplication and addition: uneasiness_t = -log p(at) + \gamma uneasiness{t-1} and as a base case, uneasiness_0 = 0

Instead of picking a single gamma, you can try several at once, and just define a different uneasiness feature for each gamma. Throw them all into the classifier and the classifier can figure out which ones work best.

  - maintain a beam of size k and calculate the entropy of their
  predicted actions or states (POS tag, action, which word is on the stack,
  etc.) equal to -sum(x_1*log(x_1) + x_2*log(x_2) + ... + x_n*log(x_n)) where
  x_i symbolizes the probability of a certain action.

Let's call this one the entropy feature. It has the unusual property that it has the same value for all of the k states in the beam.

But let's be more precise to make sure we are in agreement. Let p_1, ..., p_k be the relative probabilities of the different states on the beam. So sum_{i=1}^k p_i = 1.

Now select some some attribute F of the state, such as the identity or POS tag of the top word on the stack. In your notation, F has possible values 1 ... n, and the x values are found as follows:

initialize x values to 0 for i=1 to k j = F(state i) x_j += p_i

Note that sum_{j=1}^n x_j = 1. Now you can compute the entropy as above.

You have one such entropy feature for each attribute F. What attributes F do you want to consider? Make sure to include attri butes that look at the arcs.

BTW, presumably MaltParser already keeps track of the top k states (the beam) but you may have to add code to keep the p_i values up to date. You can work all along with relative probabilities, so you won't have to worry about underflow. It's easy: If you take action a from state 7, then the probability of the resulting state is p_7 * p(action a | state 7). If you try all actions from all of the k states, then the larger set of resulting states still has probability that sums to 1. The beam method throws away all but the top k, and then you should normalize the probabilities of the top k so that they sum to 1 again.

(Note: At the start of the parse there will only be 1 state in the beam (the initial state), and it may take a few steps before the beam reaches its maximum size. My notation above doesn't properly distinguish between the current number of states and the maximum number of states; it calls them both k.)

| - maintain a beam of size 1 and r random paths such that at every

  step, each of the random paths choose an action at random. The idea is that
  the random paths symbolize the most likely states of the parser, which we
  can compare with that of the single beam

Actually, for this method, the beam doesn't have to have size k=1. It's just this feature continues to make sense even when k=1, unlike the entropy feature.

Let's call this the survey feature, because it samples the population of all possible states at the present time to vote on whether a given state has the right value of attribute F. Again, there is a separate survey feature for each attribute F. The value of this feature on a given state is -log (fraction of the r random states such that F(the random state) = F(the given state))

Except that this could lead to log 0. So instead of considering only the r random states, throw in the given state as well. Thus, you are going to take the fraction of r+1 states that agree with the given state. Since the given state certainly agrees with itself, this fraction is at least 1/(r+1).

The r random states are chosen by a simpler method than beam search. There are always r states in this set and it may contain duplicates -- at the start of the parse, all r states are the initial state. At each time step, each state picks a random action from its probability distribution over actions, and is updated by taking that action.

For k>1, you can also do a council *feature *that asks only the k leaders to vote, instead of asking a random sample of the population: -log (sum of p_i for all states i in the beam such that F(state i) = F(the given state))

  - break up into separate uneasiness factors for each action
 (tagging with POS, parser actions, etc.)

Here I think you mean different attributes F. These are not necessarily actions, but just properties of the parser state. (Yes, one attribute is the last action that was taken.)

3. Sub-dependency tree of k words on stack and first word on buffer

features consist of arcs between k+2 nodes, including k words, a node representing all words before those k words, and a node representing all words after those k words

Ok. So you have as many features as you have trees on that many nodes? What is k and how many features will you therefore have? If two trees have the same topology but different POS tags, do they yield the same feature?

4. Context-Free Grammar constraints on the arcs

upon REDUCing, label the parent node with a new POS tag of the variety PP, VP, NP, etc.; impose a loss or some sort of consequence for not-well-formed grammars

I'm unclear on how this would be defined, and where you would get data on how to label the parent node and whether the tree is well-formed.

karuiwu / reparse

Iterate on features for the classifier #6