Open antoine-de opened 6 years ago
more thoughts on the subject:
I fill that a rigid tree hierarchy works well for the administrative regions.
a suburb
can have at most one city_district
(or none and be linked directly to a city
), and it's the same between the city
-> state_district
-> state
-> country_region
-> country
.
The categories are optional (maybe apart from the country
), and we can image places where child of a country are heterogeneous (cites, states, ...).
This model does not however work for at least 2 examples (feel free to add counter examples, I'm sure there are more):
eg Marne-la-Vallée that is a group of cities in france. It has no administrative meaning, but is well know by locals.
eg.le marais, a french touristic zone that span across parts of 2 paris districts
eg. la defense near paris span over parts of 5 cities.
it can also happen with non official neighborhood that cross several district
s
neither the non administrative zone nor the administrative zone contains the other, they just intersect.
I think it's nice for any zone (administrative of not) to have administrative parents as it's helpful to know that Marne la vallée
is part of île de france
thus part of france
One idea would be, for any zone to have:
As a starting point I think we can use the same algorithm for both relationship: You need to be exactly included in administrative region shape to be part of it, and you need to be exactly included in a non-administrative shape to be linked to it.
All the cities of Marne-la-Vallée
are soft linked to it but are part of Seine-et-Marne
.
There is no soft link between Le Marais
and either the 3rd paris district nor the 4th paris district, because there is no inclusion, Le Marais
is just part of Paris
To attach zones to a point (like we need to do in a geocoder), we'll search for all leaf-zone that contains the point.
As a first implementation, we can even just search for all zones that contains the point and filter the leaf (so lowest level admin + all non related non-administrative zones)
I don't know :wink:
le marais
, get all the cities that are part of Marne la Vallée
), thus the raw dataset is a bit harder to useI like the idea to have a strong representation where everybody can hang onto, and something less organized for local specific things.
A post code could be a soft zone, right? If we ever get a shape of those (that might be a problem in many places of the world) then you could be precise, otherwise a list should be sufficient.
As you mentioned, le Marais is better known by the locals than the administrative quarters.
The problem here is that there is no specific border, but rather fuzzy.
An attribute on the zone could help here, even if we don’t know how to use the data (or represent it on OpenStreetMap).
It would also be nice to be able to represent the Schengen zone.
The airport of Paris is not in Paris, and tourists think that the Château de Versailles is in Paris.
This would mean that the ontology has a notion on who’s asking?
With strict inclusions, we will have a tree :palm_tree:, which is nice.
The lowest level must be variable. The commune
Paris is divided in arrondissements
and each arrondissement
is divided in quartier administratif
. Most French commune
are the lowest subdivision.
Useless trivia: Google believes there is a quarter :banana: in Paris. Unknown by the inhabitants and not an official one neither https://encrypted.google.com/search?hl=en&q=quartiers%20la%20banane%20paris
When going from the lowest to the highest, the system needs to have holes. For instance the commune
Nantes belongs to the Métropole
Nantes, but not every commune
belongs to a Métropole
.
I don’t think it is a problem.
Useless trivia : this island is under direct authority of a ministry, with no intermediate administration https://en.wikipedia.org/wiki/Clipperton_Island
Åland belongs to Finland. Finland belongs to the European Union, yet Åland does not belong to the European Union (yay! cheap booze :champagne:)
For the vast majority of situations, this can be ignored. Maybe it could be handled with an explicit exception once the big work is done.
There seem to be surprising few situations of that kind https://en.wikipedia.org/wiki/Dependent_territory
A territory can be under the sovereignty of two countries, like https://en.wikipedia.org/wiki/Pheasant_Island
Ok. I think we can ignore this one.
If I might be so bold, the 🍌 area actually does exist and is known to, at least some, inhabitants.
This could be a typical example of how different people view the same area differently.
Indeed, I should not take my ignorance as a general rule. I would be curious to know where the data from Google comes from
Regarding the tree structure, not sure it works. It can be a DAG though I think.
Take postal codes for example. In France you will have, potentially, several communes
to a single postcode, but in the UK, many cities have more than one postcode. If you want to handle this worldwide, you need a separate branch for postcodes from that of admin, imo.
hum for postal codes, don't you think soft links (so outside the official hierarchy) would be enough ?
You're right, the tree Vs DAG is really an important question, we really need to think about this carefully
Some thoughts concerning wikidata.
It is a database closely linked to Wikipedia defining semantic relations between objects.
The licence is CC0, so that won’t be a problem.
With OSM as the geographic leg, Wikidata as the semantic one, Cosmogony should be able to have all the needed informations.
First obvious benefit: the ID will probably be much more stable than OSM elements or even Wikipedia pages.
It handles the historization of elements, meaning that an Id will not be recycled for a new object (e.g. two communes that merge).
Paris will always be Q90.
The wikidata ID should be in the OSM object tags. We should do a batch to have the order of magnitude of admin
objects without a wikidata id.
There is already an hierarchy with the property P131 that indicates the belonging to a larger zone.
This could avoid some wrong hierarchies that would be only detected through geographical inclusion (simplified borders, weird enclaves…)
Wikidata has good chances to keep working over time. Any hand made fix will therefore stay there for good and will help to improve commons.
This will reduce the need of adhoc databases.
The dump is 20Gb large. This will be a problem for someone working on a small territory.
My guess it that it will be very easy to generate a subset that focuses on the admin regions.
I also had a look at wikidata as a potential source of information to build the hierarchy. I agree stable IDs would be useful.
P131 is promising, but does not seem so easy to use. Its definition is unclear and I can easily find inconsistencies in the data. See Quimper (Q342) :
A similar issue is visible with Marne-la-Vallée (Q1886380) (we really like that example ^^):
3 departments are listed under P131 property, although Marne-la-Vallée just overlaps a part of them.
Note : here is the SPARQL query I used, to find geographical entities with multiple P131 statements.
Anyway, we have two candidate approaches to build the hierarchy (geographical inclusion and wikidata). We may choose one, and create some QA checks or additional tools to test our data against the other one.
Here's the issue to start the discussion about schema of our zones "hierarchy".
The aim of this issue is to fill the concerned section in the README
here are my non structured thoughts:
categories
I like libpostal categories, libpostal is quite a reference in the address parsing world and we can hope their categories can handle all the countries specificities all around the world, but I don't think it handles all the corner cases (and it's not the only category out there, for example Wof uses another).
libpostal does not handle non administrative regions apart from the
suburb
(and maybe thecountry_region
). So it would be difficult to represent Marne-la-Vallée or parc du mercantourThere is also the question of postal codes. I don't know whereas we could/should have postal codes zones in the hierarchy (should we create a separate issue for this ?)
Pyramidal hierarchy or graph-based ?
Can a zone have at most one parent or can it have several.
I fill that it might be a failing of Wof to have a pyramidal hierarchy. I don't think it will complicate cosmogony that much to be able to have several parents. I don't think it's useful for purely administrative regions (but maybe there are countries where it's relevant), but for non-administrative regions I think a pyramidal hierarchy will be too restrictive.
Eg. what would we link Marne-la-Vallée to ? ile de france ? but then it would be difficult to link it back to the cities that are part of it. The same apply for non official suburbs that can span across several district
links coherence
Wof hierarchy is nice, but being linked to all parents brings incoherence (like france empire that contains france country but the empire has less descendant than the country. I fill like outputting only the first level of relationship force the dataset to be coherent (even if so it will make the dataset harder to use without tools)