tdwg / bdq

Biodiversity Data Quality (BDQ) Interest Group
https://github.com/tdwg/bdq
43 stars 7 forks source link

TG2-VALIDATION_GEOGRAPHY_CONSISTENT #95

Closed iDigBioBot closed 5 months ago

iDigBioBot commented 6 years ago
TestField Value
GUID 78640f09-8353-411a-800e-9b6d498fb1c9
Label VALIDATION_GEOGRAPHY_CONSISTENT
Description Is the combination of the values of the terms dwc:continent, dwc:country, dwc:countryCode, dwc:stateProvince, dwc:county, dwc:municipality consistent with the bdq:sourceAuthority?
TestType Validation
Darwin Core Class Location
Information Elements ActedUpon dwc:continent
dwc:country
dwc:countryCode
dwc:stateProvince
dwc:county
dwc:municipality
Information Elements Consulted
Expected Response EXTERNAL_PREREQUISITES_NOT_MET if the bdq:sourceAuthority is not available; INTERNAL_PREREQUISITES_NOT_MET if all of the terms dwc:continent, dwc:country, dwc:countryCode, dwc:stateProvince, dwc:county, dwc:municipality are EMPTY; COMPLIANT if the combination of values of dwc:continent, dwc:country, dwc:countryCode, dwc:stateProvince, dwc:county, dwc:municipality are consistent with the bdq:sourceAuthority; otherwise NOT_COMPLIANT
Data Quality Dimension Conformance
Term-Actions GEOGRAPHY_CONSISTENT
Parameter(s) bdq:sourceAuthority
Source Authority bdq:sourceAuthority default = "The Getty Thesaurus of Geographic Names (TGN)" [https://www.getty.edu/research/tools/vocabularies/tgn/index.html]
Specification Last Updated 2023-09-22
Examples [dwc:continent="", dwc:country="Australia", dwc:countryCode="", dwc:stateProvince="WA", dwc:county="", dwc:municipality="": Response.status=RUN_HAS_RESULT, Response.result=COMPLIANT, Response.comment="dwc:stateProvince matches dwc:country"]
[dwc:continent="", dwc:country="", dwc:countryCode="", dwc:stateProvince="WA", dwc:county="", dwc:municipality="": Response.status=RUN_HAS_RESULT, Response.result=NOT_COMPLIANT, Response.comment="dwc:stateProvince ambiguous as "WA" could be the state of Washington in the United States or Western Australia in Australia"]
Source VertNet, Kurator
References
Example Implementations (Mechanisms) Kurator
Link to Specification Source Code https://github.com/kurator-org/kurator-validation/blob/master/packages/kurator_dwca/workflows/dwca_geography_assessor.yaml
Notes A fail condition may arise from the content being internally inconsistent (not all of the information can be true at the same time), or from the vocabulary being incapable of resolving the combination of geography field values. Additional tests could be devised against a geographic authority to report the distinct failure conditions. This test specifically does not consider the content of dwc:higherGeography. Note: that for this test to work, the lowest ranking element must be present and the higher ranking elements be consistent with it. This test is not recommended be implemented because of one or more of the following criteria: Unavailable vocabularies; available vocabularies are ambiguous; too difficult to code; too complex to currently implement; implementation could lead to ambiguous or inaccurate results.
iDigBioBot commented 6 years ago

Comment by John Wieczorek (@tucotuco) migrated from spreadsheet: This is a comprehensive geography comparison, not a field-by-field comparison. Example lookup table at https://github.com/VertNet/DwCVocabs/blob/master/vocabs/Geography.csv

ArthurChapman commented 6 years ago

This may not be the best example

godfoder commented 6 years ago

img_20180117_145858

ArthurChapman commented 2 years ago

There has been discussion between Issues #95 and #139 - the wording converged over time such as the two tests appeared to be testing for the same thing. Discussion on ZOOM has resulted in separating the two.

139 is testing individual terms for validity at that level - it looks at only one level in the hierarchy at a time and checks the validity of what is there at the level.

95 is testing for inconsistencies between levels in the hierarchy - for example Western Australia (WA) as a State, and USA as a country - i.e. one is wrong and thus ambiguous.

Tasilee commented 1 year ago

The conclusion from the Zoom discussion 22/8/2022 with @chicoreus, @tucotuco and @ArthurChapman suggests that this test is more a test for consistency than ambiguity. My suggestion from 17/8/2022 was "...we get ambiguity when we have only two non-empty terms that conflict in some way".

This same reasoning applies to #123.

I have edited the specifications accordingly and would value a careful review of all relevant items.

tucotuco commented 1 year ago

The conclusion from the Zoom discussion 22/8/2022 with @chicoreus, @tucotuco and @ArthurChapman suggests that this test is more a test for consistency than ambiguity. My suggestion from 17/8/2022 was "...we get ambiguity when we have only two non-empty terms that conflict in some way".

This misses the case "WA", which we had in the original discussion and the reason why we used "ambiguity" rather than "consistent" in the test name. The "WA" alone case does not have two terms with which to check consistency. It has one term where there are multiple different geographic entities it could refer to. That is purely ambiguous and not inconsistent.

I think this is one of the tests we have gone in circles on, but I can no longer recall. If not, I think we are just about to. Is that a good indicator to separate the notions into two tests? The solutions for the two cases are distinct, so maybe it is a good idea. In the consistency case, something is definitively wrong and should be fixed. In the ambiguity case, something is missing and ought to be provided.

Tasilee commented 1 year ago

139 is currently testing that each NOT_EMPY geography term has an unambiguous match at the same level in the source authority:

...COMPLIANT if the individual values of dwc:continent, dwc:country, dwc:countryCode, dwc:stateProvince, dwc:county, dwc:municipality can be unambiguously resolved from the bdq:sourceAuthority)...

The example where we have only dwc:stateProvince="WA" will result in NOT_COMPLIANT if there is ambiguity.

95 is currently testing for internal/input consistency between NOT_EMPTY input geography terms and the equivalents in the source authority, so the example above will be CONSISTENT.

...COMPLIANT if the combination of values of dwc:continent, dwc:country, dwc:countryCode, dwc:stateProvince, dwc:county, dwc:municipality are consistent with the bdq:sourceAuthority...

Are those tests what we require? Do the results of the white board examples still make sense for #95 and #139?

Maybe I am (still) crazy.

| Continent | Country | State/Province | County | 95 | 139 | | | Australia | | | Consistent | Standard | | | | WA | | Inconsistent | Standard | | Oceania | Australia | Florida | | Inconsistent | Not standard | | | | | Allachua | Consistent | Standard |
| | | | Jefferson | Consistent | Not standard | | | Australia | WA | | Consistent | Standard | | Oceania | Australia | WA | Jefferson | Inconsistent | Not standard | | Oceania | Australia | WA | | Consistent | Standard | | | Fred | | | Inconsistent | Not standard |

chicoreus commented 1 year ago

78640f09-8353-411a-800e-9b6d498fb1c9 was duplicated in #118 and #123, retained here, and replaced in those issues with new uuid values.

tucotuco commented 1 year ago

I maintain as argued in the comments of test #139 that the test is problematic, less useful, and should be dropped. It is dubious whether there is any realistic way to make sure each name "matches" the level in the hierarchy where it is placed. The Darwin Core term names are a poor reflection of the impressive variety of terms for the actual administrative levels in the world.

What's really important is how many distinct geographic entities the geography combination corresponds to in the source authority.

If there are zero matches, the input combination may have an error or the authority may be incomplete (or incorrect, such as out of date, no way to tell the difference, potentially difficult to solve). Getting zero matches would alert the tester of a potential problem of a particular nature (unknown-ness). If there is one match, the input can be uniquely understood vis-à-vis the source authority (unambiguously understandable - nothing to solve). If there is more than one match, something is ambiguous (not enough information provided to make the distinction - usually relatively easy to solve).

Once one thinks about implementation, this test becomes a lot more complex than it appears on the surface. Begin with the results of a "simple" implementation that gets results for the number of exact string matches for the rightmost not empty entry in a row where a) every entry to the left of it in the row can also be found as an exact match parent somewhere in the hierarchy above it , and b) the same pattern holds true successively for every not empty field except continent, which has no parent in Darwin Core. There is no single API call against any existing service that will do this, though one is planned for BELS.

A slightly more sophisticated implementation would restrict the searches for values in the country field to record types (nation, dependent state, unincorporated territory, semi-independent political entity, etc.). Given this more sophisticated implementation, the table below shows the results one would get today for the Getty Thesaurus of Geographic Names.

Example continent country stateProvince county municipality Matches Comment
1 Australia 1
2 Russia 1
3 Russian Federation 1
4 Asia Russian Federation 1
5 Europe Russian Federation 0 TGN does not have Europe as a parent to the Russian Federation
6 Asia Russian Federation Moscow Oblast 1
7 Europe Russian Federation Moscow Oblast 0 TGN does not have Europe as a parent to the Russian Federation
8 Europe Moscow Oblast 1 TGN does have Europe as a parent to the Moscow Oblast, with no intervening parent
9 Moscow Oblast 1
10 WA >1 Matches Western Australia, Washington State (US)
11 Oceania Australia Florida 0
12 Alachua >1 There is an Alachua municipality in Alachua County
13 Florida Alachua >1 There is an Alachua municipality in Alachua County
14 Allachua 0 There is no feature named "Allachua"
15 Alachua Alachua 1 There is an Alachua municipality in Alachua County
16 Australia WA 1
17 Oceania Australia WA Jefferson 0 There is no exact match for "Jefferson" in Australia
18 Oceania Australia WA 1
19 Fred 0 There is no country-like feature named "Fred"
20 Fred 1 There is an inhabited place named "Fred" in Tyler County, Texas

The thing is, the count as a result would make this a measure test, I think, and I don't think that is what we want. Does that mean we are forced into having two tests, one to see if there are any matches and one to see if there is more than one match? Seems wasteful in terms of processing. Needs more thought. Open to suggestions.

ArthurChapman commented 1 year ago

Great summary, @tucotuco. I can see lots of value in an ideal world, but as you say an implementation nightmare. Would there be value in just doing a few consistency checks. Country, State/Province/ and perhaps Municipality? Many countries use County consistently, other don't, and there is a lot of difference in what is meant by County in different countries, and sometimes includes >1 level. ArcInfo decided to just us ADM0, ADM1 and ADM2 rather than labelling them - I know there are other levels, but basically they use just these three in much of the data. Would there be value in us keeping this test and just using these three levels for this test?

tucotuco commented 1 year ago

@ArthurChapman First, I am not trying to get rid of this test. A functional implementation would be extremely useful.

However, as to your proposals, there is no way to do consistency checks by admin level with TGN. The vocabulary for the feature types does not map uniquely to Darwin Core terms and, as you pointed out, the levels are arbitrary.

Example, Brazil. If anyone uses a macroregion, the states get bumped down to level 2. If anyone uses a mesoregion or a microregion, counties and municipalities get bumped down for each of those. So a "county level" entity in Brazil could be at any of three different depths in the hierarchy, one of which could not even be captured in Darwin Core outside of dwc:higherGeography. Brazil is not unique in this phenomenon.

By the way, for posterity, Julian Kapoor, working with Robert Hijmans on GADM under the Biogeomancer Project, assembled this list of administrative level terms: Single Administrative area, Administrative county, Administrative Region, Aimag, Amt, Aprinki, Apskritis, Area, Arrondissement, Arrondissements, Arrondissment, Atoll, Autonomou, Autonomous city, Autonomous Commune, Autonomous Community, Autonomous Island, autonomous province, Autonomous Region, Autonomous Republic, Autonomous sector, Avtonomiuri respublika, Avtonomnaya oblast, Avtonomnyy okrug, Aymag, Baladiyah, (Banner), (Barangay), (Barony), Bibhag, Borough, Bundeslander, Canton, Capital city, Capital district, Capital Metropolitan City, Capital region, Capital Territory, Capitale d'état - zone spéciale, Castello, Census Area, Census Division, Centrally Administered Area, Cercle, Chantun, Chuan-shih, Circle, City, City and Borough, City and County, City Municipality, City/Municipality, Ciudades autónomas, Comarca, Comisaría, Commissiary, Commonwealth, Commune, Commune Autonome, (Community), Comuna, Comunidad Autónoma, Comunidad autónomas, Concelho, Constituen y, Constituency, Corregimiento, Corregimiento de, Country, County, (Crown Dpendency), Daerah Khusus ibuk, Daerah Istimewa, daerah-daerah, Departament, Departamento, Département, Départements, Departments, Dependencias Federales, Dependency, Development Region, Diamerismata, Distirct, District, District Municipality, Distrikkaya, Distrikt, Distrito, Distrito Capital, Distrito Federal, Distrito Municipal, Distrito Nacio, Division, Do, (Duchy), Dzongkhag, Economic Prefecture, Eilandgebieden, Emirate, Entity, Estado, Faritany Mizakatena, Faritra, Federal Dependency, Federal District, (Federal Subject), Federal Territory, Fivondronana, Fovaros, Fu, Fylke, Gorod, Gorsovet, Governorate, grad, Gwangyeoksi, Hlavni mesto, Hoofdstedelijke gewest, Hsien, (Hundred), Independent City, Independent Town, Intendancy, Intendencia, Intendency, Island, Island council, Island group, Island Region, Judet, Kabupaten, Kaghak, K'alak'i, Kampeng nakhon, Kanton, Kaupstadir, Kayaing, Ken, Khêt, Khetphiset, Khoueng, Kingdom, Kommuner, Kotamadya, Kraj, Kraje, Kray, Kreisfreie Städte, Krong, Laen, Land, Länd, Lander, Landsvæðun, (Legal entity), Local Authority, (Local Council), Maakond, Magisterial district, Marz, Megye, Mehoz, Metropolis, Metropolitan City, Miesto savivaldybė, Mintaqah, Mkoa, Moughataas, Muhafazah, Munic¡pio, Municipality, Municipio, Municipio Especial, Municipiu, Muong, National capital - special zone, National Capital Area, National Dist, National Territory, Neutral City, Neutral Zone, Nomos, Oblast, Oblasy, Opcine, Opština, Ostan, Parish, Parròquia, Part, Partido, Police Station, Prefecture, préfecture, préfecture economique, Prefegitura, propinsi, (Principality), Province, Provincia, Província, Provincie, Provinsie, (Public body), Pyine, Qark, Région capitale, Raion, Raione, Rajoni, Rajono savivaldybė, Rayon, Reef, Regency, Região, Regierungsbezirk, Region, Région, Región Autónoma, Regional council, (Regional County Municipality), Regional District, Regional Municipality, Regione, Republic, Respublika, Ressort, (Riding), Rural District, (Rural Municipality), Sahar, Savivaldybė, Sector, Sector autónomo, See, Senatorial District, Sha`biyah, Sheng, Shih, (Shire), Si, sous-préfecture, Sous-régions, Special City, (Special administrative region), Special district, Special Municipal, Special municipality, Special region, Special region or zone, Srok, State, Statistical Region, Statisticna regij, Subdistrict, Sub-district, Sub-prefecture, Sub-region, Sýsla, Syssel, Taluk, Tarafa, Territoire, Territorial authority, Territorial Unit, Territorio Nacional, Territory, Teukbyeolsi, Thana, Thanh Pho, (Theme), Tinh, To, Todof, (Town), Town council, (Township), Traditional county, Union territo, Union territor, Unitary authority, United Counties, unknown, Upazila, Urban district, Urban prefectur, velayat, Vikas kshetra, Village, Ville Neutre, Voblasts', Voivodship, water bodies, Wilaya, Wilayah persekutuan, Wilayat, Wojewodztwa, Yin, Zila, Zizhiqu, Zupanija, županija.

Tasilee commented 1 year ago

The zoom discussion with @ArthurChapman, @tucotuco and @chicoreus today concluded that tests #95, #139 and #118 were going to be very difficult to implement properly given the lack of a consistent geographic terms hierarchy by comparison with the taxonomic terms. Note the issues arising from the table above for example. We will therefore remove these tests from CORE.

In their place, we will

  1. Add a test for dwc:stateProvice found to complement #21 (which we will rename)
  2. Add a test for dwc:country dwc:stateProvince combo exist at least once in the bdq:sourceAuthority (country-state/province consistent)
  3. Add a test for dwc:country dwc:stateProvince combo exists exactly once in the bdq:sourceAuthority ((country-state/province unambiguous)
chicoreus commented 1 year ago

Further notes from the zoom discussion with @ArthurChapman, @tucotuco and @Tasilee:

Continent values and their use tend to be very inconsistent between data in the wild and source authorities. Conclusion was to focus on dwc:country and dwc:stateProvince values as noted above.

The matches column in @tucotuco's table above clarified the problems we have been having untangling the concepts of consistency and unambiguity in hierarchically organized data. The concept we have been trying to label consistency aligns with the property of having one or more matches on the source authority. The concept we have been trying to label unambiguity aligns with the property of having exactly one match on the source authority.

As noted in @tucotuco's list of divisions above, the ranks found in Getty do not neatly align with dwc:country and dwc:stateProvince, a simple example being United Kingdom (Nation), England (Country) in Getty, where, given #62, we would expect dwc:country to have a value that would match to the United Kingdom (Nation) value in Getty, rather than the included country level term in Getty.

tucotuco commented 1 year ago

Specifically, what we mean by country in Darwin Core is an administrative entity corresponding to place types "nation", "dependent state", "unincorporated territory", "semi-independent political entity", etc. where the list covers all of the entities in the list of ISO country codes.

chicoreus commented 1 year ago

Important summary here, but this test appears to be intractable to implement, so marking as non-core after discussion.

ArthurChapman commented 10 months ago

Added to the Notes (see comments under #123 for discussion of reasons. This a parallel case.)

"Note: that for this test to work, the lowest ranking element must be present and the higher ranking elements be consistent with it."

Do we need to reword the Expected Response?

chicoreus commented 9 months ago

Changed Field to TestField, added ActedUpon/Consulted, added date last modified.

ArthurChapman commented 9 months ago

Changed "Output Type" to TestType and deleted "Warning Type". Updated Specification Last Updated

ArthurChapman commented 5 months ago

@Tasilee - I though this was a DO NOT IMPLEMENT given our definitions

Tasilee commented 4 months ago

Aligned specifications to match current template