Features used to train the average perceptron in the parser

Please could you tell me the features used to train the average perceptron model to parse the OSM addresses

From a previous answer:

As is common in NLP (maximum-entropy Markov models, conditional random fields, etc. - see the work of Adwait Ratnaparkhi, Michael Collins, John Lafferty, Andrew McCallum, etc.), the tokens in the string are a sequence x of length T with labels y (labels are also a sequence of length T so each token has a label). Predictions are made left-to-right over the sequence using a feature function Φ that is called at each timestep i with the following arguments: Φ(x, i, y_i-2, y_i-1) where the y_i-2 and y_i-1 are the model's own predictions for the two previous tags (this it what makes it a sequence model rather than general multi-class classification). That function returns any number of string features (treated as binary indicator features in this case but can also be real-valued) which might include things like "word i=franklin" or more complex features like "prev tag=house_number and word i=franklin and word i+1=avenue". We additionally pre-compute some indices like known place names, etc. so phrases like "New York" can be grouped together.

Each string feature is mapped to a row index in the weight matrix, which is N x M (where N is the # of features and M is the # of labels) as in any other multi-class classification model. Since word usage is Zipfian-distributed (frequent words are very frequent, long tail of less common words), there are typically millions or tens of millions of distinct features in a model like this so this matrix is quite large. However, one of the nice things about perceptron learning is that the weights start at zero and are only updated when the model makes a mistake. As the model iterates over the randomly-shuffled training examples and starts to converge, it makes fewer mistakes, meaning the weight matrix can be kept incredibly sparse for rare features. It's also possible to make the weights even sparser by only considering weights which have participated in >= n updates and treating the rest as 0s.

The feature extraction code is defined in address_parser.c, and there are some new features being worked on in the parser-data branch.

Features used in master are:

bias term (added to almost every example, acts as an intercept)
current normalized word where normalized means:
- lowercased
- Latin-ASCII transliterated (no longer done in parser-data as this can remove accents from the input in a way that's impossible to recover)
- HTML entities converted
- final/internal periods removed
- digits normalized to "D" so there's not a separate feature for "123" and "124", they both just normalize to "DDD",
- for ideographic languages a "word" is a single ideogram though something like mecab could be added for better word segmentation)
- for known phrases the "word" may be a complete phrase like "New York City"
- words that occur fewer than n times in the training data are replaced with "UNKNOWN" (we use n=5), which helps the model be more robust to unknown words/misspellings at test time.
whether the word is part of a known phrase
- the phrases from libpostal's dictionaries for street types, etc.
- phrases compiled from the training corpus for place names (cities, states, countries, etc.)
previous word
next word
bigrams (prev word + current word, current word + next word)
Separate feature if the phrase is unambiguous ("Montréal" only ever refers to a city)
whether the word starts with a known prefix or ends with a known street suffix (useful in Germanic languages where many words may be concatenated together to form a street name but it will most likely end in "straße")
previous tag + current word
previous tag
previous 2 tags + current word
previous 2 tags
previous tag + previous word

And for the upcoming release some of the new features are:

n-grams for the "unknown" words (occurred fewer than n times in the training set)
for unknown words that are hyphenated, each of the individual subwords if frequent enough, and their ngrams otherwise
an index of postcodes and their admin contexts built from the training data (the intuition is that something like "10001" could be a postcode or a house number, but if words like "New York", "NY", "United States", etc. are to its right or left, it's more likely to be a postcode).
for first words that are unknown (could be part of a venue/business name, could be a rare/misspelled street), a feature which finds the relative position of the next number and the next address phrase if present. Usually if the parser gets the first word in the string correct it will get the entire string correct.

Thanks for such a detailed approach.I have a few other small queries

1.The last feature of "previous tag + previous word" which you are using .Can you tell the reason of using this combined feature when you are using both of them separately as well. 2.Are you combining the separate features with a space and using the combined feature or what 3.Please can you give a small example of how you are encoding the feature I am working on a college project and will be grateful if you could answer these.

All of the features which do not include the current word are designed to let known and unknown words/phrases share statistical strength. Intuitively, prev tag + prev word can help in cases where a previous word might take on a non-standard meaning (the phrase "New York" used as part of a venue name like "New York Health and Racquet Club" instead of the city - could help with "New York" as a venue followed by any random noun whether it was seen in the training data or not). Also empirically it improves held-out performance.
Each feature uses a template prefix like those above, followed by the dynamically-generated contents, delimited by pipes ("|").
An example feature that would would be generated at t=4 in the "New York Health and Racquet Club" example would be "word+next word|racquet|club". The feature function is called once for each timestep, and may generate different features based on where it is in the string. At t=3, that same feature would instead generate "word+next word|and|racquet".

In the new release I'm adding an option in the parser's command-line client to print features at each timestep for debugging.

Hi my sense for asking the third part "Please can you give a small example of how you are encoding the feature" was that when I use such a feature the model size blows exponentially are you using some optimized encoding tactic to keep the model size in check if yes can you throw some light on that technique.Are you doing some feature hashing ?

No feature hashing or anything like that. Bigram features will increase the model size dramatically, not exponentially though if the perceptron learning is properly implemented. Again, the averaged perceptron only ever has non-zero feature weights after it's made a mistake for a given (feature, class) pair, which means that only a fraction of the bigrams will ever be used, and thus don't need to take up any space.

For libpostal, there are two different kinds of data structures used, read-write structures for training time (memory is not as much of a concern here) and read-only structures for runtime (want the minimize the model's footprint for users' applications).

At training time said data structures look like the following (writing this in Python for succinctness but in libpostal this is all in C, so there's some additional memory savings there):

# {feature: id}
features = {'feature1': 0, 'feature2': 1, 'feature3': 2}
# {label: id}
classes = {'label1': 0, 'label2': 1, 'label3': 2}
# {feature_id: {label_id: weight}}
weights = {
    0: {0: 12.345, 1: -1.034}
    1: {0: -0.32, 1: 1.923},
    2: {1: -4.343, 2: 23.904}
}

Remember that a feature is only assigned an ID/space in the dictionary the first time it gets updated (i.e. the first time the model gets an example wrong where that feature was observed). In the first thousand or so examples, the model is wrong virtually every time, but that drops off quickly thereafter (at least for the address parsing task). After a million examples or more, the model will be getting almost everything right, so the only features that get updated and thus added to the dictionary are those that are helping the model improve. The weights also benefit from sparsity because in most cases there should only be a few correct classes per feature ("New York" can be a city, state, venue name, road name, etc. but not a house number or a country) and a small number of classes with negative weights corresponding to incorrect guesses that the parser made for examples where that feature was observed.

After training is finished, the model's data structures can be significantly compressed into more efficient read-only versions. The features hashtable is converted to a trie (allows features with the same prefix to share storage), and the weights are converted to a sparse matrix in compressed sparse row (CSR) format. Overall the runtime model is still hundreds of megabytes, which is not uncommon on data sets of comparable size, but still relatively small for how much it's learned.

From a previous answer:

As is common in NLP (maximum-entropy Markov models, conditional random fields, etc. - see the work of Adwait Ratnaparkhi, Michael Collins, John Lafferty, Andrew McCallum, etc.), the tokens in the string are a sequence x of length T with labels y (labels are also a sequence of length T so each token has a label). Predictions are made left-to-right over the sequence using a feature function Φ that is called at each timestep i with the following arguments: Φ(x, i, y_i-2, y_i-1) where the y_i-2 and y_i-1 are the model's own predictions for the two previous tags (this it what makes it a sequence model rather than general multi-class classification). That function returns any number of string features (treated as binary indicator features in this case but can also be real-valued) which might include things like "word i=franklin" or more complex features like "prev tag=house_number and word i=franklin and word i+1=avenue". We additionally pre-compute some indices like known place names, etc. so phrases like "New York" can be grouped together.

Each string feature is mapped to a row index in the weight matrix, which is N x M (where N is the # of features and M is the # of labels) as in any other multi-class classification model. Since word usage is Zipfian-distributed (frequent words are very frequent, long tail of less common words), there are typically millions or tens of millions of distinct features in a model like this so this matrix is quite large. However, one of the nice things about perceptron learning is that the weights start at zero and are only updated when the model makes a mistake. As the model iterates over the randomly-shuffled training examples and starts to converge, it makes fewer mistakes, meaning the weight matrix can be kept incredibly sparse for rare features. It's also possible to make the weights even sparser by only considering weights which have participated in >= n updates and treating the rest as 0s.

The feature extraction code is defined in address_parser.c, and there are some new features being worked on in the parser-data branch.

Features used in master are:

bias term (added to almost every example, acts as an intercept)

current normalized word where normalized means:

lowercased

Latin-ASCII transliterated (no longer done in parser-data as this can remove accents from the input in a way that's impossible to recover)

HTML entities converted

final/internal periods removed

digits normalized to "D" so there's not a separate feature for "123" and "124", they both just normalize to "DDD",

for ideographic languages a "word" is a single ideogram though something like mecab could be added for better word segmentation)

for known phrases the "word" may be a complete phrase like "New York City"

words that occur fewer than n times in the training data are replaced with "UNKNOWN" (we use n=5), which helps the model be more robust to unknown words/misspellings at test time.

whether the word is part of a known phrase

the phrases from libpostal's dictionaries for street types, etc.

phrases compiled from the training corpus for place names (cities, states, countries, etc.)

previous word

next word

bigrams (prev word + current word, current word + next word)

Separate feature if the phrase is unambiguous ("Montréal" only ever refers to a city)

whether the word starts with a known prefix or ends with a known street suffix (useful in Germanic languages where many words may be concatenated together to form a street name but it will most likely end in "straße")

previous tag + current word

previous tag

previous 2 tags + current word

previous 2 tags

previous tag + previous word

And for the upcoming release some of the new features are:

n-grams for the "unknown" words (occurred fewer than n times in the training set)

for unknown words that are hyphenated, each of the individual subwords if frequent enough, and their ngrams otherwise

an index of postcodes and their admin contexts built from the training data (the intuition is that something like "10001" could be a postcode or a house number, but if words like "New York", "NY", "United States", etc. are to its right or left, it's more likely to be a postcode).

for first words that are unknown (could be part of a venue/business name, could be a rare/misspelled street), a feature which finds the relative position of the next number and the next address phrase if present. Usually if the parser gets the first word in the string correct it will get the entire string correct.

Actually, I am using a library (pycrfsuite) for the CRF implementation, Would you please explain how it may be possible for me to include models's predicted tag of the previous word in the features for the current word ! Please help ... Thanks in advance :)

openvenues / libpostal

Features used to train the average perceptron in the parser #164