osm-without-borders / cosmogony

easy to use & easy to update geographic regions
http://cosmogony.world
Apache License 2.0
103 stars 15 forks source link

Quality Assurance #4

Open nlehuby opened 6 years ago

nlehuby commented 6 years ago

Some ideas to test the quality of our dataset:

Non closed boundaries We need to log the list of the boundaries that could not be imported because they are not valid polygon / multipolygon

Hierarchy coherence

Coverage stat and tests By country statistics: Compute the geographical coverage in states, cities, etc. (example: 88% city coverage, which means that 88% of the country territory is inside a city)

Persist expected values and test them in CI: for example:

Volumetric stat and tests Stat: same as below, but only raw numbers, without geographical concerns (example: Australia country has 17 states)

Test:

Expected values for each country must be in a config file (CSV, YAML ?) and not inside the code source, so that anybody can update it if needed.

Tristramg commented 6 years ago

Maybe the tests could be split in three:

I’m not sure exactly why you want to have it in a separate repository. If someone wants to suggest a fix or have an alternative reality, I would still be simple, no? — or did you just mean that the configuration should be in a .yaml and not in a .rs, but still in the same repository?

antoine-de commented 6 years ago

Nice categories, it seems fine for me.

I think also think it's ok to put the test in the same repository (and @nlehuby too :wink: ), we just want quality tests easily maintained (so no .rs)

nlehuby commented 6 years ago

Here is a proposal for a first step, only dealing with volumetric stat. We may enrich this in the future to compute other stats and add other tests (such as the geographical ones listed in this issue) or create another dedicated tool.

Todo : compute volumetric stats for each country test the stats against expected values (this may be a py.test module) provide a output format suitable to create a cool web dashboard hosted in a dedicated new repo : cosmogony data dashboard (we can discuss the name ;) )

In : a cosmogony file a file with statistic references values by country

for instance a csv file :

wikidata_id zone_type expected_min expected_max is_known_failure
Q142 state 18 18
Q142 state_district 96 96
Q142 city 35000 36000
Q142 city_district 35000 36000 yes

Out : a stat file with statistics values by country the results of the tests

This could be a single csv file:

wikidata_id zone_type expected_min expected_max is_known_failure obtained test_status
Q142 state 18 18 18 ok
Q142 state_district 96 96 96 ok
Q142 city 35000 36000 36678 ko
Q142 city_district 35000 36000 yes 4560 ok
Q142 suburb 345 skip
Tristramg commented 6 years ago

I like the general idea.

Where the data will be hosted, against what it will be tested doesn’t matter much for me (but I have a slight preference towards large mono-repos).

What do you mean with the wikidata_id? The property of that level? That might become a problem as cities can be of different type (think of the German Kreisfreie Stadt).

However, we could maybe add extra tests, like having 4 state_district Q202216 (département d’outre-mer) in France, as those might break easily with bad country shapes.

If we want no specific constraint, we can leave the wikidata_id empty.

Is that clear?

nlehuby commented 6 years ago

for now, wikidata_id stands for a country wikidata id (Q142 is France). It may be extended to any zone wikidata id in the future.

We could definitly use wikidata ontology to check the quality of our data. But I think your proposal adds a lot of complexity: we will need to explore wikidata to map each of our zone with its wikidata properties (to know that Guadeloupe relation from our PBF is actually a Q202216 (overseas department of France))

and we may also need to map wikidata ontolology to libpostal zone type, country by country in the same way to what has be done for OSM ... For instance, we will need to explicit that what we call

This seems possible and would add very valuable quality tests, but I really think we should start with a smaller task with no dependency to a wikidata dump ;)

nlehuby commented 6 years ago

init of reference values for countries stat: https://github.com/osm-without-borders/cosmogony-data-dashboard/pull/1