openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.09k stars 421 forks source link

[German] single uppercase character after streetnames #259

Closed tobwen closed 7 years ago

tobwen commented 7 years ago

In German street system, there can be an addition, uppercase char after streetnames. These are rare, but they exist, f.e. in the Ruhrgebiet (metropolean Ruhr area).

These scheme is like this:

  1. Meyerstraße A 12
  2. Peterstraße A, B, C ... F 13

Right now, this parses correctly. Is it luck? Could it break in future or has this rule already been implemented correctly?

albarrentine commented 7 years ago

So there might be a slight misconception about how libpostal works. The parser is not rule-based. It uses machine learning (check out the two detailed blog posts in the README for more information), so ensuring that it can parse a particular example is less about implementing any particular parser rules and more about creating training examples that encourage the behavior we'd like to see. With machine learning we're essentially optimizing a function of the errors the parser makes, so if we have many examples of a particular pattern in the training data, it becomes costly for the parser to make that kind of mistake. It's a slightly different way of thinking than many software developers are accustomed to, and tends to give the developer a bit less control over the outcomes than, say, writing a regex that will deterministically do the same thing every time. However, the advantage is that a learning algorithm can literally examine every address on the planet and try to find the best set of weights so that it makes as few errors as possible on the examples it has seen.

Most of our data for Germany comes from OpenStreetMap, and in terms of OSM Germany has better coverage than almost anywhere else in the world, so it's very likely that most of the street names/addresses you're concerned about are in OSM (the examples above are generic but if you know which city they're in, etc. can try looking them up). If those street names are not in OSM, feel free to add them. OSM is collaboratively editable by anyone (I'm personally a contributor), and adding examples there is probably the best way to ensure that they're handled by libpostal. Almost every improvement to OSM ends up having a positive effect on the parser.

For patterns that are not common in OSM like apartment numbers, etc. or some edge-case formats, we often generate the necessary patterns randomly and add them to some portion of the addresses.

tobwen commented 7 years ago

I'm sad you're closing this issue without a discussion. This makes people think: hey, why should I support this project and why not spend time on another one? But okay - it's your project.

I'm an active developer and I'm using machine-learning (=ML) on big-data analysis for geodata, text and imagery. Many programmers have the attitude that ML should only be used as a "last resort". After reading to code and closed issues, libpostal actually does seem to have rules (regex etc.) for (additional) handling of special cases.

The case I've informed you about is a legit case for German street naming. Just because it doesn't appear on OpenStreetMap (= OSM) isn't a pretty good argument. Input of a generic library like libpostal (you always state that it's a generic library) shouldn't be limited to the data which OSM is containing.

FYI: I'm into OpenStreetMap for about 10 years now and I have already lectured at Intergeo, FOSSGIS and universities on this topic

albarrentine commented 7 years ago

Sorry, that seemed to me to be phrased as a question rather than an issue that needs to be resolved. Can reopen it if you prefer. I get many questions on here and, since Github does not have a built-in mechanism for asking questions, people usually use issues, and for the most part I simply close questions after answering them to keep open issues to a minimum. Having many open issues can weigh significantly on a maintainer's mind (and certainly on mine since I try to respond to everyone within a day when possible). Closing them is not meant to shut down discussion, and indeed there's still a thread. It simply means there were no action items required, and since issues serve as a high-level TODO list for this project, I've found that makes sense.

Glad you're involved in machine learning and OSM. Please also note that I get many users from different countries and industries with varying skillsets and language levels, so usually assume basic or non-familiarity with ML or OSM for people new to the project and try to offer detailed explanations of what is very often an unfamiliar technology (even for people who've worked with ML in some capacity before, I cannot always assume familiarity with Conditional Random Fields, etc.)

It didn't seem from the above like libpostal is getting anything wrong. It was simply asking "is there some rule that guarantees it?" The answer is: no, there's no rule guaranteeing it, but if it were getting these cases wrong, it's always possible to add some street names to OSM to encourage it to learn to produce the right answer. It's a sort of roundabout way of "fixing" something, so it's usually helpful to provide some explanation of why it has to be done that way (and why it usually takes a bit longer to show up in master than fixing something deterministic like a regex).

Yes, there are a significant number of rules and cleanup operations we apply to the training data for libpostal (I do not think of ML as a last resort, but it is frequently the last step in a much larger pipeline), which I mentioned briefly above. For context, almost all of those cases are written in response to various issues that have come up (either from Github or my own experiments), and apart from cases that are systematically wrong because of some convention in OSM, etc. it's better in my opinion to let the source data stand on its own.

Libpostal is definitely not limited to what's contained in OSM (that would be simple memorization, and it would be nearly impossible to include e.g. every venue/business name since they change so frequently), or even the many other data sets from which we extract the training examples. We specifically model/simulate unknown words by thresholding rare words for inclusion in the vocabulary and using individual 4-gram and/or ideogram features to help with street/venue names that might be misspelled in the input or missing from the source data, especially in countries that are not as well-represented in OSM as Germany.

Our 99.45% accuracy is on held-out addresses that the model has not seen before, so clearly I'm not making any argument that "if it doesn't appear in OSM, it's not a legit case." This might be a miscommunication on my part. What I said was: if the parser were getting this type of case (street names) consistently wrong in the future, it's always possible to add them to OSM to encourage the right behavior. The street names submitted above did not seem to occur in OSM, nor did they come up in a Google search, possibly because there's no particular city attached, only a relatively wide geographic area. In general, it's much easier to work with specific examples that do exist (even if not necessarily in OSM) than examples of patterns that are not directly/easily searchable.

In the contributing guide for libpostal, I've tried to lay out a standard format for submitting parser issues which includes a specific input address, the expected, output, some proactive steps like checking OSM (otherwise it's effectively asking me to do it) and some guidance for diagnosing different parser issues.

If this is indeed not a question, and there are action items, what were you hoping to see?

tobwen commented 7 years ago

I hope you don't misunderstand me. I love what you guys have built up here. Sorry if my comment was too harsh.

I have always designed my deep learning approaches in such a way that, for example, a regex collection comes out of it at the end. So I was able to "cache" the results or adjust the set of rules according to my needs.

With deep learning, you can get different results in a second run. The procedure is often a black box. My question has therefore focused on whether it is just coincidence that the above-mentioned special case has worked. The result was correct, but I can't estimate how "stable" it will be in future updates of the model.

I would therefore like to see rare special cases (which are allowed in the notatation) included in the test case. What do you think of that?

albarrentine commented 7 years ago

So yes, in the case of deep learning, models are often something of a black box, and things like random initialization of the weights in the network, random dropout, etc. make it difficult to reproduce results exactly across runs (for reproducibility it's often necessary to use a fixed, known seed for the random number generator). Libpostal does not use any neural networks at present, deep or otherwise. Our parser model is a linear, feature-based, zero-initialized Conditional Random Field which is interpretable through human-readable features (in the parser client there's a special command .print_features which will list the computed features for each token). If things go wrong, it's usually because of some ambiguity or omission that exists in the training data, can be diagnosed easily, and either a feature can be designed to disambiguate or the training data can be modified to handle that case.

Just realized that we have a government data set for Nordrhein-Westfalen in OpenAddresses, which is also part of the training data for libpostal, and that contains 4.2M addresses in the state, including over 200 examples where the street name is like the above:

street name count
Egerländer Straße A 28
Waltemaths Feld C 20
Bartels Feld B 19
Europark Fichtenhain A 17
Bartels Feld D 16
Waltemaths Feld A 14
Waltemaths Feld B 13
Bartels Feld C 10
Bartels Feld A 9
Zementstraße A 8
E.-Schweitzer-Str. D 8
E.-Schweitzer-Str. C 8
Europark Fichtenhain B 6
E.-Schweitzer-Str. B 6
E.-Schweitzer-Str. A 6
Am Fort C 5
E.-Schweitzer-Str. F 4
E.-Schweitzer-Str. E 4
Bäckerkamp A 1

So having that many examples should pretty much function as a guarantee that a German street name followed by a single letter will be parsed as road.

As far test cases, it's probably fine to add one in this case, though in general I don't want to add too many obscure examples that will make the build fail if libpostal gets them wrong on a given run. This is not as much about the learning algorithm producing wildly different results as it is about new OSM runs, etc. The tests are there more to ensure that everything is functionally working (like not producing random labels because of a corrupt file or something), and that the examples we use in our documentation work as advertised.

Long-term functional testing of a periodically-retrained learning model is definitely not a solved problem, but it seems to me like it might make more sense to have a set of non-breaking test cases that can include the difficult, idiosyncratic cases around the world (everyone who works in geo has their 5-10 pet cases), then publish/track those results over time. That way everyone can track the cases they care about without the model needing to be 100% accurate on all of those cases in order to be published.