Open nlehuby opened 6 years ago
Maybe the tests could be split in three:
I’m not sure exactly why you want to have it in a separate repository. If someone wants to suggest a fix or have an alternative reality, I would still be simple, no? — or did you just mean that the configuration should be in a .yaml and not in a .rs, but still in the same repository?
Nice categories, it seems fine for me.
I think also think it's ok to put the test in the same repository (and @nlehuby too :wink: ), we just want quality tests easily maintained (so no .rs
)
Here is a proposal for a first step, only dealing with volumetric stat. We may enrich this in the future to compute other stats and add other tests (such as the geographical ones listed in this issue) or create another dedicated tool.
Todo : compute volumetric stats for each country test the stats against expected values (this may be a py.test module) provide a output format suitable to create a cool web dashboard hosted in a dedicated new repo : cosmogony data dashboard (we can discuss the name ;) )
In : a cosmogony file a file with statistic references values by country
for instance a csv file :
wikidata_id | zone_type | expected_min | expected_max | is_known_failure |
---|---|---|---|---|
Q142 | state | 18 | 18 | |
Q142 | state_district | 96 | 96 | |
Q142 | city | 35000 | 36000 | |
Q142 | city_district | 35000 | 36000 | yes |
Out : a stat file with statistics values by country the results of the tests
This could be a single csv file:
wikidata_id | zone_type | expected_min | expected_max | is_known_failure | obtained | test_status |
---|---|---|---|---|---|---|
Q142 | state | 18 | 18 | 18 | ok | |
Q142 | state_district | 96 | 96 | 96 | ok | |
Q142 | city | 35000 | 36000 | 36678 | ko | |
Q142 | city_district | 35000 | 36000 | yes | 4560 | ok |
Q142 | suburb | 345 | skip |
I like the general idea.
Where the data will be hosted, against what it will be tested doesn’t matter much for me (but I have a slight preference towards large mono-repos).
What do you mean with the wikidata_id? The property of that level? That might become a problem as cities can be of different type (think of the German Kreisfreie Stadt).
However, we could maybe add extra tests, like having 4 state_district Q202216 (département d’outre-mer) in France, as those might break easily with bad country shapes.
If we want no specific constraint, we can leave the wikidata_id empty.
Is that clear?
for now, wikidata_id stands for a country wikidata id (Q142 is France). It may be extended to any zone wikidata id in the future.
We could definitly use wikidata ontology to check the quality of our data. But I think your proposal adds a lot of complexity: we will need to explore wikidata to map each of our zone with its wikidata properties (to know that Guadeloupe relation from our PBF is actually a Q202216 (overseas department of France))
and we may also need to map wikidata ontolology to libpostal zone type, country by country in the same way to what has be done for OSM ... For instance, we will need to explicit that what we call
This seems possible and would add very valuable quality tests, but I really think we should start with a smaller task with no dependency to a wikidata dump ;)
init of reference values for countries stat: https://github.com/osm-without-borders/cosmogony-data-dashboard/pull/1
Some ideas to test the quality of our dataset:
Non closed boundaries We need to log the list of the boundaries that could not be imported because they are not valid polygon / multipolygon
Hierarchy coherence
Coverage stat and tests By country statistics: Compute the geographical coverage in states, cities, etc. (example: 88% city coverage, which means that 88% of the country territory is inside a city)
Persist expected values and test them in CI: for example:
Volumetric stat and tests Stat: same as below, but only raw numbers, without geographical concerns (example: Australia country has 17 states)
Test:
Expected values for each country must be in a config file (CSV, YAML ?) and not inside the code source, so that anybody can update it if needed.