openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.03k stars 416 forks source link

Probability of correct parse? #63

Open Rickasaurus opened 8 years ago

Rickasaurus commented 8 years ago

I was wondering if it might be possible to get some indicator that may help it telling if a parse went well or not. This would be very helpful for thresholding out examples that might be good to submit as training data.

albarrentine commented 8 years ago

Not a meaningful probability. Some learning models (e.g. logistic regression, used in libpostal for language classification) can return well-calibrated probability distributions over the labels. In the averaged perceptron, used for the parser, it's more similar to reinforcement learning. When learning the weights, we go through each address in the training data, run the same inference used in parse_address, compare the predicted tag to the true tag and adjust the weights only when the prediction is wrong. The learned weights are more like scores, used for ranking.

If you know you have addresses (i.e. not free-text input from a geocoder, etc.), any parse with duplicate fields is likely an error. There can still be other types of errors, but that's the most useful thresholding heuristic I can think of at the moment.

Rickasaurus commented 8 years ago

Hrm, what about the aggregate score maybe normalized by the number of non-zero components? It doesn't need to be super meaningful, just enough to histogram and toss some thresholds on top of.

albarrentine commented 8 years ago

In the absence of a probability distribution, for that sort of thresholding the weights would have to be a) positive and b) on roughly the same scale (e.g. regularized so they tend toward 0). Unfortunately neither is true for averaged perceptron.

The weights are real values between negative infinity and positive infinity, and thus the scores (dot product of weights and a binary vector) are in the same range. It wouldn't really be meaningful to compare scores across different predictions. They're only really useful for taking the argmax over classes for a particular token. I want to avoid returning things that look like probabilities, confidence scores, etc. because in my experience, developers inevitably treat numbers between 0 and 1 like they're probabilities, and usually end up disappointed when they're not. At some point it may be worth using Conditional Random Fields for the parser, which in addition to having slightly better accuracy, can return probabilities for entire parses. In that case I'd be willing to expose the n-best results and their scores.

I'd recommend looking for parses with duplicate fields first. That should catch the majority of the errors and beyond that some light random spot checking is usually sufficient for spotting problematic patterns.

As mentioned in other issues, there are a types of addresses (PO boxes, unit numbers, corporate addresses with recipient/title/department) that the model doesn't really know what to do with because they're not present in the OpenStreetMap training set, where house numbers are usually as granular as it gets. The next release generates addresses that should better resemble what we see in the wild. In the mean time I'm happy to look at failure patterns as they come up.