openvenues / gopostal

Go (cgo) interface to libpostal for fast international address parsing/normalization
MIT License
161 stars 31 forks source link

Are Parsed Address Labels Unique? #2

Closed theory closed 8 years ago

theory commented 8 years ago

Are the labels in all of the parsed address components unique for a single address? If so, it might be useful to create a type on ParsedComponent[] that can convert to and from a map, and marshall and unmarshall JSON.

albarrentine commented 8 years ago

Not necessarily. This article has some examples of where duped components can occur: https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/. Not saying the libpostal parser handles all of them correctly, just that it's possible.

As JSON, a parse could be represented as:

{
    "components": [
        {"label": "house_number", "component": "123"},
        {"label": "road", "component": "main street"}
    ]
}

We could wrap the ParsedComponent array in a response type that can be easily serialized. Thoughts?

theory commented 8 years ago

Keys: Okay.

Format: Seems to me it should just be an array:

[
    {"label": "house_number", "component": "123"},
    {"label": "road", "component": "main street"}
]

Unless you think a parsed address object could ultimately have other attributes, like the original string, or an array of expanded addresses, or a single canonical normalized representation or some such. But then I expect that would be represented by some other object that would reference the array, so the simple array would still apply.

Are "label" and "component" formal geocoding terms?

albarrentine commented 8 years ago

Array works too - that's what we do in most of the other bindings. Agreed that a new representation would need to reference the array anyway.

Ha, no "label" and "component" aren't the official lingo (if there is such a thing), just the names I use in the underlying C structs. "Label" is commonly used in the NLP/machine learning world in the context of supervised learning tasks e.g. a classifier predicts a label for a particular word in a sentence. In most similar tasks like part-of-speech tagging, chunking, named entity recognition, etc. we'd be labelling each token (so the keys would be "label" and "token"). With the libpostal address parser, when we return a result, we roll up adjacent tokens with the same class into a single string, which I'm calling "component." Happy to entertain other ideas though, especially while people are still trying it out.

theory commented 8 years ago

"Label" is good, though so would "name" or "part". "Component" feels a little odd, though, since from the POV of someone using this, it's just the value for the label. Which is why I tend to like "value" for this sort of thing.

albarrentine commented 8 years ago

Cool, "Value" it is: https://github.com/openvenues/gopostal/commit/2893ce1e67534fd8738ded1ba0ac86fb7580aa3c

theory commented 8 years ago

Ah, I like the symmetry with the length of "Label". Hadn't noticed that before. :-)

albarrentine commented 8 years ago

Ha yes, seems very Go-like.

13k commented 7 years ago

Sorry to comment on a closed issue, but this seems related. Shouldn't the go package expose the address components labels as constants?

I'm asking that for two reasons:

1) Currently there's no way to know what libpostal is returning as component labels, unless you read libpostal's code (which I had to do). This is true for libpostal, pypostal, ruby_postal and gopostal (the ones I checked). I couldn't find any mention of address component labels in neither API or prose documentation 2) Like we all know, using raw strings that are actually enum-like constants is pretty bad coding practice

albarrentine commented 7 years ago

They're not quite constants. The reason they're defined as such in C is for building certain indices during training (phrase gazetteers for cities and admin boundary names, etc.). There are more tags coming in the next release, and they will change over time, so while I agree that they do need to be documented better, I'd rather that people not rely on a fixed set of tags in client applications.

What you're calling "bad coding practice" others might call flexibility. It's common in the NLP world to use string labels since many times the same tagging model can be repurposed for multiple tasks with different labels. I could have very easily made libpostal return enums from C and then redefine them in every binding and then update every binding (or petition its author, since some bindings are maintained by people other than me) when a new label is added. Indeed I could do that, and then web APIs could return results as a CSV instead of JSON, but there are times when flexibility is important. If an enterprising developer wanted to train the model on her own data set with completely different fields, that's entirely possible. Using string field names is one of the reasons I was able to publish a backward-compatible preview of the much-improved model in libpostal's parser-data branch, which includes new fields like "unit", "level", etc. That was trained using the same code in master on different data, and nothing broke. It means the bindings can stay relatively lightweight and decouples them from changes to the C library, which is in turn decoupled from most changes to the training data. IMHO that's a good coding practice if anything.

13k commented 7 years ago

I'm sorry, I wasn't clear enough and I apologize if I sounded offending.

What I meant is that the bad practice will be present in client code, my code, by not having exported constants from the bindings. The API between C and bindings is fine (and I shouldn't care about it).

My reasoning for this is that I have to write code that decides upon the component label (via a switch or series of conditionals) and take action based on this. Hard-coding strings in these conditionals, in client code, is the bad practice. Like you said, if the labels are changed, my code is broken and depending on how I did it (without defining constants myself), hard to change also. Even if I define constants and upgrade the binding, there's no guarantee that those constants are still using the correct strings. There's a scenario where, for example, I make some components optional and ignore those missing components from addresses and my code doesn't break. If strings change in libpostal, I would be silently ignoring actual existing information, based on the fact that I'm actually testing against incorrect strings.

IMO, the distance between the client code using a binding and libpostal's internal structures are far enough that such upgrades on binding libraries would break things more often than not in cases of string changes. It's hard enough to check a direct dependency's code when upgrading a version, it's twice that difficult to also check a dependency's dependency.

That was my reasoning. All that said, though, I didn't know that by training libpostal with different data sets could result in different data structures being returned. That's indeed data-specific, so component labels aren't what I thought initially. I thought the results in address parsing would use a set of pre-defined, normalized, fixed keys for the component labels and the parser would translate whatever keys the data set gives into the normalized keys set (that would be a logical conclusion if you think that the goal of address parsing is to decompose an address into a set of previously known components).

In this case, I'm not sure how to handle it then. How can one reliably predict the label names so as to avoid being surprised by changes?

Thanks for taking the time to give a pretty explanatory answer.

13k commented 7 years ago

As a complementary note, I see that what you said about labels being data-specific and unpredictable is true. The C constants I mentioned before are different than what I'm getting in results. Postal codes aren't being returned in a postal_code component, instead they are coming in a postcode component:

$ go run postal/parser/main.go "718 Hermosa Ave, Hermosa Beach, CA, 90254"
[{Label:house_number Value:718} {Label:road Value:hermosa ave} {Label:suburb Value:hermosa beach} {Label:state Value:california} {Label:postcode Value:90254}]

So exporting as constants, even if they are "best guesses", is not exactly a solution.

albarrentine commented 7 years ago

Ok, have published some documentation in the C repo for the labels that are used by our current parser in master, so it should be ok to define constants off of that. Will update that when the new release is out.