osm-without-borders / cosmogony

easy to use & easy to update geographic regions
http://cosmogony.world
Apache License 2.0
103 stars 15 forks source link

Ontology starting point #2

Open antoine-de opened 6 years ago

antoine-de commented 6 years ago

Here's the issue to start the discussion about schema of our zones "hierarchy".

The aim of this issue is to fill the concerned section in the README

here are my non structured thoughts:

categories

I like libpostal categories, libpostal is quite a reference in the address parsing world and we can hope their categories can handle all the countries specificities all around the world, but I don't think it handles all the corner cases (and it's not the only category out there, for example Wof uses another).

libpostal does not handle non administrative regions apart from the suburb (and maybe the country_region). So it would be difficult to represent Marne-la-Vallée or parc du mercantour

There is also the question of postal codes. I don't know whereas we could/should have postal codes zones in the hierarchy (should we create a separate issue for this ?)

Pyramidal hierarchy or graph-based ?

Can a zone have at most one parent or can it have several.

I fill that it might be a failing of Wof to have a pyramidal hierarchy. I don't think it will complicate cosmogony that much to be able to have several parents. I don't think it's useful for purely administrative regions (but maybe there are countries where it's relevant), but for non-administrative regions I think a pyramidal hierarchy will be too restrictive.

Eg. what would we link Marne-la-Vallée to ? ile de france ? but then it would be difficult to link it back to the cities that are part of it. The same apply for non official suburbs that can span across several district

links coherence

Wof hierarchy is nice, but being linked to all parents brings incoherence (like france empire that contains france country but the empire has less descendant than the country. I fill like outputting only the first level of relationship force the dataset to be coherent (even if so it will make the dataset harder to use without tools)

antoine-de commented 6 years ago

more thoughts on the subject:

I fill that a rigid tree hierarchy works well for the administrative regions.

a suburb can have at most one city_district (or none and be linked directly to a city), and it's the same between the city -> state_district -> state -> country_region -> country.

The categories are optional (maybe apart from the country), and we can image places where child of a country are heterogeneous (cites, states, ...).

This model does not however work for at least 2 examples (feel free to add counter examples, I'm sure there are more):

A non administrative region that regroup others administrative regions

eg Marne-la-Vallée that is a group of cities in france. It has no administrative meaning, but is well know by locals.

A non administrative region that intersect many admins

eg.le marais, a french touristic zone that span across parts of 2 paris districts eg. la defense near paris span over parts of 5 cities. it can also happen with non official neighborhood that cross several districts

neither the non administrative zone nor the administrative zone contains the other, they just intersect.

Rough idea on how to handle those

I think it's nice for any zone (administrative of not) to have administrative parents as it's helpful to know that Marne la vallée is part of île de france thus part of france

One idea would be, for any zone to have:

As a starting point I think we can use the same algorithm for both relationship: You need to be exactly included in administrative region shape to be part of it, and you need to be exactly included in a non-administrative shape to be linked to it.

what does that means

Marne-la-Vallée

untitled diagram

All the cities of Marne-la-Vallée are soft linked to it but are part of Seine-et-Marne.

Le marais

untitled diagram-1

There is no soft link between Le Marais and either the 3rd paris district nor the 4th paris district, because there is no inclusion, Le Marais is just part of Paris

implication for the use cases

attaching zones to a point

To attach zones to a point (like we need to do in a geocoder), we'll search for all leaf-zone that contains the point.

As a first implementation, we can even just search for all zones that contains the point and filter the leaf (so lowest level admin + all non related non-administrative zones)

finding the most meaningful zone for a point

I don't know :wink:

limitations

Tristramg commented 6 years ago

Soft links

I like the idea to have a strong representation where everybody can hang onto, and something less organized for local specific things.

Post codes

A post code could be a soft zone, right? If we ever get a shape of those (that might be a problem in many places of the world) then you could be precise, otherwise a list should be sufficient.

Local knowledge

As you mentioned, le Marais is better known by the locals than the administrative quarters.

The problem here is that there is no specific border, but rather fuzzy.

An attribute on the zone could help here, even if we don’t know how to use the data (or represent it on OpenStreetMap).

Broad internationnal agreements

It would also be nice to be able to represent the Schengen zone.

External knowledge

The airport of Paris is not in Paris, and tourists think that the Château de Versailles is in Paris.

This would mean that the ontology has a notion on who’s asking?

Can it be a tree?

With strict inclusions, we will have a tree :palm_tree:, which is nice.

The obvious

The lowest level must be variable. The communeParis is divided in arrondissements and each arrondissement is divided in quartier administratif. Most French commune are the lowest subdivision.

Useless trivia: Google believes there is a quarter :banana: in Paris. Unknown by the inhabitants and not an official one neither https://encrypted.google.com/search?hl=en&q=quartiers%20la%20banane%20paris

The easy

When going from the lowest to the highest, the system needs to have holes. For instance the commune Nantes belongs to the Métropole Nantes, but not every commune belongs to a Métropole. I don’t think it is a problem.

Useless trivia : this island is under direct authority of a ministry, with no intermediate administration https://en.wikipedia.org/wiki/Clipperton_Island

The challenge

Åland belongs to Finland. Finland belongs to the European Union, yet Åland does not belong to the European Union (yay! cheap booze :champagne:)

For the vast majority of situations, this can be ignored. Maybe it could be handled with an explicit exception once the big work is done.

There seem to be surprising few situations of that kind https://en.wikipedia.org/wiki/Dependent_territory

The pain

A territory can be under the sovereignty of two countries, like https://en.wikipedia.org/wiki/Pheasant_Island

Ok. I think we can ignore this one.

poudro commented 6 years ago

If I might be so bold, the 🍌 area actually does exist and is known to, at least some, inhabitants.

This could be a typical example of how different people view the same area differently.

Tristramg commented 6 years ago

Indeed, I should not take my ignorance as a general rule. I would be curious to know where the data from Google comes from

poudro commented 6 years ago

Regarding the tree structure, not sure it works. It can be a DAG though I think.

Take postal codes for example. In France you will have, potentially, several communes to a single postcode, but in the UK, many cities have more than one postcode. If you want to handle this worldwide, you need a separate branch for postcodes from that of admin, imo.

antoine-de commented 6 years ago

hum for postal codes, don't you think soft links (so outside the official hierarchy) would be enough ?

You're right, the tree Vs DAG is really an important question, we really need to think about this carefully

Tristramg commented 6 years ago

Some thoughts concerning wikidata.

It is a database closely linked to Wikipedia defining semantic relations between objects.

The licence is CC0, so that won’t be a problem.

With OSM as the geographic leg, Wikidata as the semantic one, Cosmogony should be able to have all the needed informations.

Stable ID

First obvious benefit: the ID will probably be much more stable than OSM elements or even Wikipedia pages.

It handles the historization of elements, meaning that an Id will not be recycled for a new object (e.g. two communes that merge).

Paris will always be Q90.

The wikidata ID should be in the OSM object tags. We should do a batch to have the order of magnitude of admin objects without a wikidata id.

Higher confidence when building the hierarchy

There is already an hierarchy with the property P131 that indicates the belonging to a larger zone.

This could avoid some wrong hierarchies that would be only detected through geographical inclusion (simplified borders, weird enclaves…)

Contribute to a good database

Wikidata has good chances to keep working over time. Any hand made fix will therefore stay there for good and will help to improve commons.

This will reduce the need of adhoc databases.

Manipulating the data

The dump is 20Gb large. This will be a problem for someone working on a small territory.

My guess it that it will be very easy to generate a subset that focuses on the admin regions.

amatissart commented 6 years ago

I also had a look at wikidata as a potential source of information to build the hierarchy. I agree stable IDs would be useful.

P131 is promising, but does not seem so easy to use. Its definition is unclear and I can easily find inconsistencies in the data. See Quimper (Q342) :

A similar issue is visible with Marne-la-Vallée (Q1886380) (we really like that example ^^):
3 departments are listed under P131 property, although Marne-la-Vallée just overlaps a part of them.

Note : here is the SPARQL query I used, to find geographical entities with multiple P131 statements.

nlehuby commented 6 years ago

Anyway, we have two candidate approaches to build the hierarchy (geographical inclusion and wikidata). We may choose one, and create some QA checks or additional tools to test our data against the other one.