Open jayantpande opened 8 years ago
This has come up a few times in previous issues, but I really don't recommend training your own models. If you're an NLP researcher, similar techniques could be adapted to training on your data set, but the models in this repo are designed for training on every address in OpenStreetMap. There are different assumptions that can be made with millions of addresses than with smaller data sets.
If libpostal is not working well in certain places, it might be easier to just post the specific issues you're seeing. Also of note, there's a new model being trained in the parser-data branch that handles more countries, recognizes more place names around the world, and contains new labels for things like PO boxes, apartment numbers in many languages, etc.
I would like to know if it parses a string like this
"india is consisting of many states for eg tamil nadu, andra prsdesh, utarnchal'"
States parsed - uttarchal, tamil nadu, andra pradesh
P.S - i deliberately put the incorrect names of states.
*uttaranchal
@rahulroxx not really. That's more like general named entity recognition. Technically I guess it's possible to extract place names from arbitrary text using our gazetteers, but it's not on the road map for libpostal.
As far as automatic spelling correction, again it's technically possible but not implemented. It has been a requested feature since many people are using libpostal for geocoders dealing with user input, which is often misspelled. Spelling is only really an issue when it comes to place names. If one word in a road name is misspelled, libpostal will just treat it as an unknown word and most of the time classify it correctly from the surrounding words.
What are the features used for training the model using perceptron ? More appropriately, which encoding you used to convert "string addresses" to floats/int which perceptron can handle?
@shikharcic as is common in NLP (maximum-entropy Markov models, conditional random fields, etc. - see the work of Adwait Ratnaparkhi, Michael Collins, John Lafferty, Andrew McCallum, etc.), the tokens in the string are a sequence x
of length T
with labels y
(also a sequence of length T
so each token has a label). Predictions are made left-to-right over the sequence using a feature function Φ
that is called at each timestep i
with the following arguments: Φ(x, i, y_i-2, y_i-1)
where the y_i-2 and y_i-1 are the model's own predictions for the two previous tags (this it what makes it a sequence model rather than a general multi-label classification task). That function returns an array of string features which might include things like "word i=franklin" or more complex features like "y_i-1=house_number and word i=franklin and word i+1=avenue". We additionally pre-compute some indices like known place names, etc. so phrases like "New York" can be grouped together. This is probably more important to understanding a sequence model than how the weights are optimized.
To answer your question: each string feature is mapped to a row index in the weight matrix, which is N x M
(where N is the # of features and M is the # of labels) as in any other multi-class classification model. Since word usage is Zipfian-distributed (frequent words are very frequent, long tail of less common words), there are typically millions or tens of millions of distinct features in a model like this so this matrix is quite large. However, one of the nice things about perceptron learning is that the weights start at zero and are only updated when the model makes a mistake. As the model iterates over the randomly-shuffled training examples and starts to converge, it makes fewer mistakes, meaning the weight matrix can be kept incredibly sparse for rare features.
Thanks for the concise explanation.