Open Location Codes needs an Open database of Anchors

google / open-location-code

Open Location Code is a library to generate short codes, called "plus codes", that can be used as digital addresses where street addresses don't exist.

https://plus.codes

Apache License 2.0

4.06k stars 471 forks source link

Open Location Codes needs an Open database of Anchors #461

Closed BruceFast closed 1 year ago

BruceFast commented 3 years ago

I really like OLC. It makes sense, it is brief and printable. I am concerned that the What Three Words community will gain significant traction, even though their product is inferior in most ways.

I am currently developing an OLC app in Flutter. What is stymying me is the anchor system that google maps uses. It seems to draw relevant location names for OLC points in a manner that I cannot, as a developer, access. Further, as I read critique of OLC, the proprietary nature of anchoring is recognized as a concern.

I believe that what is necessary is an open database of anchor points. First, a set of rules must be developed to determine the precise location of the anchor point. (Yes, I believe that it should be a "point", with at least a 10 digit OLC.) For a city, should the anchor point be city hall or the geographic center of the city? What of points of interest? And how do you control the entrepreneur that puts his insignificant shop as an anchor point?

Additionally, others have suggested a + - model, where the anchor is encoded in the - sign. This makes lots of sense to me. Every anchor could have a long or short form. (It may be desirable to have the first character or two represent the country. Other rules may apply so that primary points have fewer letters than less used points.)

It appears to me that there must be certain discipline when editing. For instance, everyone who can edit should have an account. All edits should be marked with who made them. A quick decision should be able to erase any "this guy is a quack" activity.

bocops commented 3 years ago

Hi @BruceFast.

As you've already mentioned, this is a topic that pops up regularly. You already mentioned #398 (the "PLUS-MINUS" suggestion), the probably longest discussion about it was in #343, but there's also adjacent stuff like #352.

Overall, I'd agree that an open database would be nice to have - I'm just still a bit concerned about the feasibility. For example, I believe that doing it how you outlined it in the third paragraph would not work. We can't just choose one exact location per city, because that doesn't help us with shortening an arbitrary code. If we have city centers A and B, and a full code that represents a location exactly at the midpoint of the line from A to B, which one do we use as a reference location when shortening the code? Even if we're trying to shorten some other full code along the line but not at the midpoint, we can't be sure that the closer city center is the best fit.

Basically, for shortening to always work without any error, we always need access to some form of reverse geocoding. If we don't want or have that, and can accept some minor errors in the form of reference locations that technically work, but are strictly incorrect, then the following might be appropriate:

We typically want to lose four code digits when shortening: "9F4MG98H+G3" --> "G98H+G3, $REFERENCE_LOCATION"
To pick the proper reference location, have a database containing the best fit for all six-digit prefixes: "9F4MG9:Berlin"
If we want to get fancy, each of these entries could optionally contain multiple reference locations with criteria such as "highest population", "largest area", "alternate jurisdiction", ...

This database would need to have 9x18x20⁴ ~26 million entries, so this would be a huge although not completely impossible undertaking getting this off the ground. Last but not least

It appears to me that there must be certain discipline when editing. For instance, everyone who can edit should have an account. All edits should be marked with who made them. A quick decision should be able to erase any "this guy is a quack" activity.

This seems to describe a github repository exactly. Only people with accounts can edit or send pull requests, and "blame" allows identifying who added some line. The question is, do we have enough people willing to participate? I'd be up for it, but with no one else to contribute this will become yet another stale repo soon. :)

BruceFast commented 3 years ago

Thanks for your thoughtful response. I haven't yet had time to read the threads you pointed to. I have had time to digest some of your other thoughts.

First, let me address the point vs rectangle issue. Every point would define a rectangle, or in fact multiple rectangles. Each rectangle is +- 1/2 of the relevant range. If a + code is provided with 6 digits, and a given anchor, it would mean that it is the one unique 6 digit + code that is at the anchor +- 0.025 degrees both latitude and longitude. (There is only 1). For a 4 digit + code, it would be +- 0.00125 degrees, and for 8 digit codes, it would be +- 0.5 degrees.

Let me first address the need for a huge database of "correct" anchors for every location. I think this to be unnecessary. If two anchors have overlapping regions at a given resolution (4, 6, 8 digit), then a plus code that inhabits either region could correctly be encoded with either. So for "9F4M+G9:Berlin", one could encode "9F4M+G9:Schonefeld". Both would get you to the same location. Location + anchor simply means code that is "nearest to" the provided anchor's epicenter.

Last is the parallel between the called for database security and github's security. I admit that I am green as grass when it comes to github, but it appears to me possible that github itself could hold the repository. If github could provide the physical database as well as the security, that would be fantastic. Further, github's openness, and lack of posessiveness is exactly what is called for for such a database.

BruceFast commented 3 years ago

Bocops, I perused thread 343. The biggest concern raised on the thread seems to be offline access to the data. The second seems to be the sense of ownership, or "magic" on the part of Google re: anchor addresses.

I propose a database with the following structure: UniqueId Anchor Name: Parent Id: 'This would naturally generate a tree structure Short Name: 'Ideally short, upper case, no spaces. Short names should have structure, disciplin. Prominance: 'Initially I propose prominance to be primary, secondary or terciary Date of definition: Date of most recent change: Date of most recent downstream addition: 'With this date an offline database can quickly determine it's validity Date of most recent downstream edit: 'As above, but edits can produce invalid anchors, where additions would not.

(A separate database, which would not normally need to be accessed, would contain the change logs.)

BruceFast commented 3 years ago

I examine what this would mean from my vantage point -- Whitehorse, YT, Canada (94G6QV67+HF). Yukon is a wilderness tourism mecca. We have a shockingly low population base, and a gazillion points of interest. There are 3 primary communities in Yukon (communities big enough to have their own hospital.) I would see the relevant database to look something like this: World: (id = 0) Short name: "" Prominance: primary

code: 6FG22222+22

Canada: (id = 1, parent id = 0), 'Every other country in the world should have a parent id of 0 Short name "C" 'Countries should get one or two digit short names Prominance primary 'One might propose a threshold where small countries are dubbed secondary. This would allow for a shortening of the list of countries

Code: 95CP2222+22 'Approximately the center of Canada

Yukon: (id = 2, parent id = 1), Short name: "CY" Prominance: primary

Code: 94P68G22+22 'Aproximately the center of Yukon

Whitehorse: (Id = 3, parent id = 2) Short name: "CYWH" Prominance: primary

Code: 94G6PWCW+8X 'Steps of City Hall

A database like this could be quickly pruned and held offline. If, for instance, I was planning to work the Yukon, it would be quite reasonable to download all data from the Yukon. I may choose to not download "tertiary" anchors.

Now let's look at a "tertiary" anchor as presented in Google maps. Ask google for 2V47+W9 Morley River, Yukon, It will come up with a "rest stop". Go to earth view and zoom in. Nothing but forest. There is a place to pull your vehicle off the road and use an outhouse. Some decades ago there was also a small pit stop and motel just up the road, but it's long abandoned. "Morley River" would surely be classified as tertiary. But who knows that? Some guy at google maps? No. Locals know that. For Yukon, I would be reasonably qualified to determine which communities are primary (there are 3, maybe 5), secondary (there's about 20), or tertiary (one could probably find 200 without reasonable overlapping).

Further, as I think of + codes for use by delivery services (think fedex, pizza and chinese food) I want to work with as few digits as I can get away with. As is fairly common, Whitehorse is broken up into about 8 communities. Everybody running a delivery service in Whitehorse knows where Porter Creek or Granger are. So for such a service, I should be able to report my + code with usually 4 digits, plus my community name as an anchor. I would see such as treed under Whitehorse, and given the status of secondary.

I think that this treed model, with dates of most recent changes, and with prioritization, a public database of anchor points would be easily developed. It could be maintained and populated by local people, placing major burden on nobody.

I think that to start, the database could be populated with every airport code. Such databases, with lat/long, are surely readily available. It would be a jumping-off point for the public database.

bocops commented 3 years ago

Regarding the matter of point vs. rectangle, the problem I see is not the exact data structure, but rather the size and shape of areas whether we would store them implicitly e.g. as strings that are plus code prefixes or explicitly as lat/long/distance in comparison to "ground truth".

Take Berlin, Germany as an example (works quite well, because the borders of German city states are shown on the map as dotted lines). If you look at the "9F4M rectangle", you will find that all of Berlin is located inside that rectangle. This means that all locations in Berlin have a plus code starting with "9F4M", but also that the string "Berlin" should technically[1] work to shorten all locations inside that rectangle. However, if we try we quickly run into problems:

"VVVV+VV Berlin" is not found at all
"WWWW+WW Berlin" is recovered to a completely different location in Berlin
"XXXX+XX Berlin" (just like "2222+22 Berlin") is recovered to a location near Berlin Lake in the US.

While 2 might be an unrelated bug, 1 and 3 clearly fail because just dropping letter from a full code and replacing it with a random city name in that area isn't how shortening actually works. So, we might want to store bounding boxes instead - center coordinates plus offset, or two coordinates. The problem then is that cities are typically not rectangular, so a bounding box will contain locations that are not in the city.

While this might not be a big problem in cases like "Berlin" as the largest city in its area, it might be problematic for cities along state or even national borders. Hopping over to your side of the pond, large cities like Toronto or Detroit are very close to the US/Canadian border, so their bounding box will likely contain parts of the other country. While "73G4+G7 Detroit" resolves to the correct location, it is at least weird to describe an international airport in Canada by using the name of a city in the US.

There's a whole bunch of heuristics we could throw at that problem - like, for example, favoring nearby over more distant anchor points, or weighing them by their bounding box size, or by the population they represent, or some other TBD metric - but in the end, we would always end up with a line connecting neighboring anchor points A and B, and another line perpendicular to it indicating whether we want to use A or B as the reference location when shortening plus codes along that line. What this boils down to is a Voronoi diagram of earth's surface, which might still be wrong along the edges unless we're packing it really densely along all important borders.

In the end, we would probably end up with about equally many entries in a "bounding box database" like this and in a "up to 6-digit plus code prefix" database, and with similar problems as well.

Regarding some of the other things you bring up, namely "short names" and "local delivery", I wonder how necessary the suggested database would be for either case in the first place.

"PWCW+8X, CYWH" has the problem of not being recognized by Google Maps or probably any other maps or geolocation service. It is also not shorter than just using the full plus code to begin with.

Local delivery services, on the other hand, don't even need a globally unique address to service you. If you order pizza and tell them to deliver to "PWCW+8X", they would need to have a delivery radius of half a degree or more before that location is no longer unique for them. Even then, that radius could easily be extended to a full degree or more by just clarifying via shared knowledge like "PWCW+8X in Whitehorse".

[1] I say "technically", because there's a Wiki page suggesting that feature names should be used within a rectangle of 0.8°x0.8° of its center location: https://github.com/google/open-location-code/wiki/Guidance-for-shortening-codes

BruceFast commented 3 years ago

Interesting discussion.

You said, "1 and 3 clearly fail because ... [that] isn't how shortening actually works." The problem I see is that how anchoring is currently implemented (in google maps) doesn't work! More accurately, anchoring is not clearly defined at all. As I develop a plus code app, that becomes blatantly obvious. The fact that "Berlin" could mean two different places, on two different continents is simply ludicrous.

OLC, at its core, is simple, structural, reasonable (can be reasoned.) Anchor points should be the same. Latitude and longitude have never acknowledged international boundaries, I fail to see why + codes should care.

Of course, there is another concern with the concept of defining the minimum rectangle that completely contains a city. That is that city boundaries are, well, arbitrary. They are not particularly respected by residences and businesses alike. I don't know about Berlin, but I do know about Whitehorse. There is a rather large catchment area around Whitehorse. The people who live in that catchment area still see themselves as living in Whitehorse. Why should they be prohibited from using Whitehorse as an anchor for their + code?

Now for your point, ""PWCW+8X, CYWH" ... is also not shorter than just using the full plus code." Firstly, CYWH is certainly shorter than "Whitehorse, YT, Canada". My current focus is on the tourism industry. I consider a local publication that connects tourists with pretty much every business in town. It is ubiquitous, and is just referred to by locals as "the Whitehorse book". If a person visits Whitehorse, and uses my software, they select CYWH or "Whitehorse, YT, Canada" into the "anchor" window. From then on, all + codes are interpreted as being anchored to that location. From the perspective of the publication, they would publish short form + codes for every location in the book. They would only need to publish the anchor point once in the book, (or possibly once per page.) If a person is taking notes about their travels they may desire to use CYWH rather than convert the short code that they have read to its long equivalent.

As for, ""PWCW+8X, CYWH" has the problem of not being recognized by Google Maps or probably any other maps or geolocation service.", that is a bit of a problem. OLC is an open standard, published under the Apache license and given to the world. Anchoring is closed, and literally mystical as you have so well demonstrated. Further, OLC is clearly in its infancy. Very few geolocation services support it. It has the potential to be about 50 million times more popular than it is. I dare to suggest that one reason OLC has been so slow to be taken up by the world is that anchoring is, well, closed and mystical. This needs to change. At the current stage of acceptance of OLC, making the necessary change is very realistic.

bocops commented 3 years ago

I think we need to be clear about three different processes and where/why/how each of them does or doesn't fail. :)

The first one is the Open Location Code algorithm itself. This is what this repository is mostly about, but although you state that "anchoring is not clearly defined at all", I believe that, at least as far as OLC itself is concerned, reference locations are a relatively clear case: whenever you have a plus code and a reference location, you can hand over both to some shorten() function which determines whether or not (and by how much) the plus code can be shortened with respect to the reference location. How you then communicate this reference location to other parties involved is up to you - and if you communicate the reference location as anything else than a lat/long pair, it is up to you and not part of the OLC specification to make sure that the resolved reference location is the same (or at least "similar enough") as the one originally used to shorten the code.

The second one is Google's own use of plus codes in their various products, most prominently Google Maps. Whenever the short form of a plus code is displayed there, it is up to them to make sure that whatever string is used for the reference location can be turned back into proper coordinates, so that recovering the full plus code works. As far as I can tell, this does typically work - the examples I gave above using "Berlin" as a reference location failed not because they are not unique, but because they were used in a self-created short plus code that would not have been generated by Google Maps in the first place.

For what it's worth, an accessible Google Maps API exists. You may have to pay something, and may not do everything with whatever data you receive - but it's out there to use.

Regarding this second process, there has been much discussion about whether that is proper or improper (as some sort of "embrace, extend and extinguish" tactic) use of plus codes - and while we can all have some opinion about that, I'd say that none of that is really an issue with this repository of OLC implementations.

The third process is the way we use plus codes ourselves, whether that is privately or as a developer - and how we deal with the fact that whatever happens on Google Maps is the de facto standard even if we don't like the exact implementation. I have tried using plus codes myself in some apps, and there might be better workarounds for your use-case than having people select an arbitrary 4-letter code from a drop-down. If you're interested in discussing this further, I think the best place to do that would be the Plus Codes Community Forum instead of this issue. I'd be happy to join the discussion with some ideas.

bilst commented 1 year ago

From (newly updated) https://github.com/google/open-location-code/wiki/FAQ#reference-location-dataset : The open source libraries support conversion to/from addresses using the latlng of the reference location. Callers will need to convert place names to/from latlng using a geocoding system.

Providing a global dataset isn't within scope of this project. For a potential free alternative, see Open Street Map and derived geocoding service Nominatim.