Open iDigBioBot opened 6 years ago
TestField | Value |
---|---|
GUID | f18a470b-3fe1-4aae-9c65-a6d3db6b550c |
Label | VALIDATION_COORDINATESSTATEPROVINCE_CONSISTENT |
Description | Do the geographic coordinates fall on or within the boundary from the bdq:sourceAuthority for the given dwc:stateProvince or within the distance given by bdq:spatialBufferInMeters outside that boundary? |
TestType | Validation |
Darwin Core Class | dcterms:Location |
Information Elements ActedUpon | dwc:stateProvince |
dwc:decimalLatitude | |
dwc:decimalLongitude | |
Information Elements Consulted | |
Expected Response | EXTERNAL_PREREQUISITES_NOT_MET if the bdq:sourceAuthority is not available; INTERNAL_PREREQUISITES_NOT_MET if the values of dwc:decimalLatitude or dwc:decimalLongitude are bdq:Empty or invalid, or dwc:stateProvince is bdq:Empty or not found in the bdq:sourceAuthority; COMPLIANT if the geographic coordinates fall on or within the boundary in the bdq:sourceAuthority for the given dwc:stateProvince (after coordinate reference system transformations, if any, have been accounted for), or within the distance given by bdq:spatialBufferInMeters outside that boundary; otherwise NOT_COMPLIANT. |
Data Quality Dimension | Consistency |
Term-Actions | COORDINATESSTATEPROVINCE_CONSISTENT |
Parameter(s) | bdq:sourceAuthority |
bdq:spatialBufferInMeters | |
Source Authority | bdq:sourceAuthority default = "10m-admin-1 boundaries" {[https://www.naturalearthdata.com/downloads/10m-cultural-vectors/10m-admin-1-states-provinces/]} |
bdq:spatialBufferInMeters default = "3000" | |
Specification Last Updated | 2024-08-30 |
Examples | [dwc:stateProvince="Tasmania", dwc:decimalLatitude="-42.85", dwc:decimalLongitude="146.75": Response.status=RUN_HAS_RESULT, Response.result=COMPLIANT, Response.comment="Input fields contain interpretable values"] |
[dwc:stateProvince="Córdoba", dwc:decimalLatitude="-41.0525925872862", dwc:decimalLongitude="-71.5310546742521": Response.status=RUN_HAS_RESULT, Response.result=NOT_COMPLIANT, Response.comment="Input fields contain interpretable values but coordinates don't match dwc:stateProvince with buffer"] | |
Source | ALA |
References |
|
Example Implementations (Mechanisms) | |
Link to Specification Source Code | |
Notes | The geographic determination service is expected to return a list of names of first-level administrative divisions for geometries that the geographic point falls on or within, including a 3 km buffer around the administrative geometry. A match on any of those names should constitute a consistency, and dwc:countryCode should not be needed to make this determination, that is, this test does not attempt to disambiguate potential duplicate first-level administrative division names. The level of buffering may be related to the scale of the underlying GIS layer being used. At a global scale, typical map scales used for borders and coastal areas are either 1:3M or 1:1M (Dooley 2005, Chapter 4). Horizontal accuracy at those scales is around 1.5-2.5km and 0.5-0.85 km respectively (Chapman & Wieczorek 2020). |
Comment by Lee Belbin (@Tasilee) migrated from spreadsheet: Unsure what spatial scale we should go down to
Comment by Arthur Chapman (@ArthurChapman) migrated from spreadsheet: Not a matter of resolution - some countries use Provinces (e.g. Canada) others States.
Comment by John Wieczorek (@tucotuco) migrated from spreadsheet: Why not just stick with dwc:stateProvince, since that is unambiguously defined as the first administrative unit smaller than country and there are over a hundred distinct names for first level divisions in the world?
Comment by Paula Zermoglio (@pzermoglio) migrated from spreadsheet: What about cases where no decimalLat or decimalLong are supplied but we have verbatimLat,Long or coords? In those cases, should this test be applied AFTER interpreting decimalLat and decimalLong?
Comment by Arthur Chapman (@ArthurChapman) migrated from spreadsheet: There is definitely an implied order (and perhaps we need to make an explicit order) for the tests - for example if it is fails the COUNTRY_COORDINATE_MISMATCH (VALIDATION_COORDINATE_COUNTRY_INCONSISTENT) then it will definitely fail this one as well so if it fails the first then this test is redundant
Difficult to get a standard vocabulary for StateProvince names and boundaries that work.
Just to add to the difficulty of getting standard vocabulary, data from some country may be in another language and it could also use another alphabets and I don't think they should be flagged as INCONSISTENT.
@cgendreau if the stateProvince name string cannot be found in either the GIS data source or a thesaurus used to find variant and internatinonalized forms of the names, then the expectation would be that this test would return a result status of data/internal prerequisites not met, with no value for result value, rather than a result value of compliant or not compliant (noting that validation result values under the framework are only CONSISTENT or INCONSISTENT).
I wonder - rather than using TGN - we can use the ISO subdivision codes ISO 3166-2 (https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes) as the Default authority. What do you think here @tucotuco ?
I think name matching is not the actual issue here, and not the authority we are after for this test. This test should use a standardized stateProvince to do the lookup, and so should ideally pass through geography standardization first. TGN is probably not the right authority, because it can't do the spatial intersection needed. An authority service based on GADM would be great. I do not know of a production level one.
Level 1 layers seem to be available, but Level 2 and lower not apparently. However the UN is apparently preparing a Level 2 DB at 1:1 million (https://www.unsalb.org/)
The FAO GAUL is (http://www.fao.org/geonetwork/srv/en/metadata.show%3Fid%3D12691) but I understand the licencing is a problem with its use "The GAUL always maintains global layers with a unified coding system at country, first (e.g. departments) and second administrative levels (e.g. districts). Where data is available, it provides layers on a country by country basis down to third, fourth and lowers levels".
ESRI has a World Administrative Divisions (to first level) at https://www.arcgis.com/home/item.html?id=f0ceb8af000a4ffbae75d742538c548b. There are also some OpenStreetmap layers that I haven't looked at but appear to be Vector layers only and in Mercator projections
Thanks @tucotuco and @ArthurChapman: If there isn't a current service that can provide spatial intersection at Level 2, this test is not operable?
The Google Maps API can return geocoding information at administrative levels more specific than country - administrative_area_level_1 is the equivalent for dwc:stateProvince. So, this is theoreticaly operable.
Level 1 doesn't seem a big problem. Level 2 is still a long way off globally. Looking at the datasets available (https://www.unsalb.org/data?page=3) so far only about 27 out of 197 countries are covered. Google Maps seems a good option for Level 1 (I haven't checked - but do they include Level 2 at all (for example for the 27 countries that SALB have?))
Level 2 exists, but I do not know what the coverage is, not do I know where to find out what the coverage is.
On Fri, Aug 9, 2019 at 7:06 PM Arthur Chapman notifications@github.com wrote:
Level 1 doesn't seem a big problem. Level 2 is still a long way off globally. Looking at the datasets available ( https://www.unsalb.org/data?page=3) so far only about 27 out of 197 countries are covered. Google Maps seems a good option for Level 1 (I haven't checked - but do they include Level 2 at all (for example for the 27 countries that SALB have?))
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tdwg/bdq/issues/56?email_source=notifications&email_token=AADQ725IMCJZZKIAS3JHCBLQDXS6VA5CNFSM4EKSMWV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD374OOQ#issuecomment-520079162, or mute the thread https://github.com/notifications/unsubscribe-auth/AADQ7232HGDTQPA6EOTCLC3QDXS6VANCNFSM4EKSMWVQ .
There is this...https://developers.google.com/maps/coverage.
On Fri, Aug 9, 2019 at 7:50 PM John Wieczorek tuco@berkeley.edu wrote:
Level 2 exists, but I do not know what the coverage is, not do I know where to find out what the coverage is.
On Fri, Aug 9, 2019 at 7:06 PM Arthur Chapman notifications@github.com wrote:
Level 1 doesn't seem a big problem. Level 2 is still a long way off globally. Looking at the datasets available ( https://www.unsalb.org/data?page=3) so far only about 27 out of 197 countries are covered. Google Maps seems a good option for Level 1 (I haven't checked - but do they include Level 2 at all (for example for the 27 countries that SALB have?))
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tdwg/bdq/issues/56?email_source=notifications&email_token=AADQ725IMCJZZKIAS3JHCBLQDXS6VA5CNFSM4EKSMWV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD374OOQ#issuecomment-520079162, or mute the thread https://github.com/notifications/unsubscribe-auth/AADQ7232HGDTQPA6EOTCLC3QDXS6VANCNFSM4EKSMWVQ .
Elsewhere in their developers guide they have the following - but there is no indication of coverage of any of these levels: administrative_area_level_1 indicates a first-order civil entity below the country level. Within the United States, these administrative levels are states. Not all nations exhibit these administrative levels. In most cases, administrative_area_level_1 short names will closely match ISO 3166-2 subdivisions and other widely circulated lists; however this is not guaranteed as our geocoding results are based on a variety of signals and location data.
administrative_area_level_2 indicates a second-order civil entity below the country level. Within the United States, these administrative levels are counties. Not all nations exhibit these administrative levels.
administrative_area_level_3 indicates a third-order civil entity below the country level. This type indicates a minor civil division. Not all nations exhibit these administrative levels.
administrative_area_level_4 indicates a fourth-order civil entity below the country level. This type indicates a minor civil division. Not all nations exhibit these administrative levels.
administrative_area_level_5 indicates a fifth-order civil entity below the country level. This type indicates a minor civil division. Not all nations exhibit these administrative levels.
I would bow to the experience of @ArthurChapman and @tucotuco on this. GADM seems to have good coverage of 'administrative areas by country' or world so it would probably be country dependent if 'level-2' info was relevant and/or available. There are currently 3% of the ALA records with missmated state vs coordinates.
There is another terminology that has been widely used in the past in biodiversity informatics (and should have been used instead of state/province and county in Darwin Core), that is primary division, and secondary division. Administrative_area_level_1 corresponds to primary division, that is the primary political subdivisions of a country level geopolitical entity; administrative_area_level_2 corresponds to secondary division, that is political subdivisions of the primary subdivisions of a country. This test, is of dwc:stateProvince, and thus of primary division, or administrative_area_level_1. Users of this test would not expect results to differ depending on the GIS dataset chosen to
Default source authority should not be a parameter - different users would not have different use cases with different sources for this test (as they might with national lists of scientific names). We can certainly recommend potential data sources for implementors to use, but these are not parameters, parameters should be reserved for cases where different use cases would place different expectations on what data values are compliant, not in different choices of differing accuracy and precision that implementors might choose with the same intent.
@ArthurChapman and @Tasilee note that this test should never be parameterized with a secondary division/administrative area level 2 GIS data set, as this would make it a test applying to dwc:county, thus an entirely different test, not a different use case for this test.
Getty should not be specified in the documentation of this test in isolation as it does not include the GIS shapes needed to compare the primary divisions with the coordinates. The natural earth data is one potential source of GIS data. However there are substantial complexities in mapping the text string values for dwc:stateProvince found in the wild onto the labels of polygons in a GIS data file. Experience shows that implementation of this test is likely to be complex.
@chicoreus You are correct and for this test we do only need to go to Level 1. I like @tucotuco suggestion of using the Google Maps API as probably easiest and most efficient (and probably maintained up to date) but unlikely to have historic names, Alternative may be ESRI (https://www.arcgis.com/home/item.html?id=f0ceb8af000a4ffbae75d742538c548b) which is available at both large and small scales. Has the ability for API building with Rest API (https://developers.arcgis.com/rest/). Included are attributes for name and ISO codes, along with notes identifying disputed boundaries and continent information
@chicoreus. If we can get a good sourceAuthority that we can recommend (as bdq:sourceAuthority) then I agree that this one doesn't need to be Paramaterized. Would make it simpler.
Have changed Parameter to 'Google Maps API' for now but we need a link. Expected Response updated.
Would Geonames be an option? We use that to get admin1 names for an atlas project I run, but I'm not sure up to date or reliable they are.
On Tue, Aug 13, 2019 at 6:30 AM Lee Belbin notifications@github.com wrote:
Have changed Parameter to 'Google Maps API' for now but we need a link. Expected Response updated.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tdwg/bdq/issues/56?email_source=notifications&email_token=ACH3QIZ2GIVKQ3HIVQ7FJGLQEI2HVA5CNFSM4EKSMWV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4EQHDA#issuecomment-520684428, or mute the thread https://github.com/notifications/unsubscribe-auth/ACH3QI32NWAGXDMHO65S2JTQEI2HVANCNFSM4EKSMWVQ .
-- Ian Engelbrecht PhD Pr. Nat. Sci. Data Coordinator: Natural Science Collections Facility South African National Biodiversity Institute Pretoria www.nscf.co.za www.sanbi.org 012 843 5194 082 763 4596 i.engelbrecht@sanbi.org.za / ianicus.za@gmail.com
Geonames is indeed a viable data source, but the web service is too heavily throttled to be of use in production.
@ArthurChapman regardless of whether or not we can find a suitable source authority, this one should not be parameterized. There may be data sets of different quality for names of primary division level geopolitical subdivisions, but different use cases do not call for a parameter that would cause this test to perform differently between use cases (something that isn't true for national taxon lists, or year or depth ranges within the scope of some project). The geopolitical subdivisions of the world are what they are and have the history that they have, data sets represent them with differing levels of accuracy and precision (and so we should point implementors at potentially more precise and accurate data sets, but not parameterize tests like this).
Do we have a rationale for the specified 3km size of the buffer? Is this based on expected precision errors in the shapefiles, or on an expectation of nearshore localities being placed within the primary division (claims of extent of primary division jurisdiction out into border lakes or marine environments differs from one primary division to another (e.g. Massachusetts claims 3 miles)).
@chicoreus The 3km is an estimate of maximum coastline uncertainty at 1:5M. I would guess that this is a difficult test to implement as buffereing coastlines is not something many would be easily able to do. Perhaps one could use a maximum euclidian distance from the coast. One paper I was just reading on the Analysis of Positional Accuracy of Linear Features (Lawford 2006) gives the following
Dataset | Source | Accuracy (metres) | Comment |
---|---|---|---|
1:25K | 1:25K maps & multiple other sources | 16 | |
1:250K | 1:250K maps & satellite imagery | 140 | |
1:1M | 1:250K data | 2000 | appears may be an error - see 1:2,5M which is less perhaps should be 700 given progression |
1:2.5M | 1:250K data | 1400 | |
1:5M | 1:2.5M data derived from 1:250K data | 2800 | |
1:10M | 1:250K data & satellite imagery | 5600 |
In Chapman and Wieczorek (in prep) we cite: "The National Standard for Spatial Data Accuracy (NSSDA) (FGDC 1998) established a standard methodology for calculating the horizontal and vertical accuracy of printed maps, which state that 95% of all points must fall within a specified tolerance (1/30” for map scales larger than 1:20,000, and 1/50” for map scales smaller than or equal to 1:20,000)"
We also include a table - which among others - gives a horizontal accuracy of 1:1M of 500 meters (Geosciences Australia), 2,777 ft (USGS) and 1900 ft (FGDC). These are all lower than that cited in the paper above (i.e. 2000 meters - which may be an error and may be meant to be 700 meters given progression.)
@chicoreus Parameterization removed.
I have added the GeoNames API into the References,
Thanks @ArthurChapman.
This issue reminds me of my comment about false positives and negatives. Buffering is to counter precision. My point here is that we expect the test to be run with lat/long and a polygon - each of the three in theory have a precision (and accuracy but that is another matter) and the accuracy of the test result is based on these precisions. We therefore are flagging or not flagging an issue based at least in part on these parameters. The law of averages would suggest the probability of a false positive or false negative is equal. My point being that the test needs to quote something like "test results are partly dependent on dwc:precision and the precision associated with bdq:sourceAuthority"?
I was about to align the Expected response with a fixed source authority as this test is NOT (currently) Parameterized. But...this is a classic where decision from @ArthurChapman and @tucotuco is required.
In the absence of a sustainable GADM API (which GBIF might actually support, though I have not asked that question) I think I would go with Google Maps Reverse Geocoding API. https://developers.google.com/maps/documentation/javascript/examples/geocoding-reverse
This indeed difficult - as a summary, I would suggest:
Sorry - but I don't think this adequately answers your question, @Tasilee. If I had so make a decision, I would probably favour GADM because it is the most used currently.
I sent mine at the same time @tucotuco responded so had not seen his response. I would concur with him and his more up-to-date knowledge in this area. So my 2. above.
I have added the Google API in the References
Just looking around - the ArcGIS Rest API is a GADM Rest API - so they are probably the same (https://gfwpro-gis.globalforestwatch.org/arcgis/rest/services/admin/MapServer/1)
Thanks @tucotuco and @ArthurChapman. I just had a play with Google reverse geocoding and it looks good. Can one of you edit accordingly?
I;m almost certainly not going to implement this for large data sets with a service invocation, I'm almost certain to use a local geospatial database. Services are available, but, experience tells us they aren't a good choice for implementation here (as opposed to scientific names, where local caching is effective for improving performance for repeated values, and name data sets are in continual flux).
Say: EXTERNAL_PREREQUISITES_NOT_MET if the specified source authority service was not available; to EXTERNAL_PREREQUISITES_NOT_MET if an external authority service was not available;
Given @chicoreus comment - I wonder if perhaps we need to Parameterize this one. I can see cases where local authorities may wish to use a different source Authority. OBIS, for example, may have a particular source they use, etc. Not sure! We would then have some options in the references where I have labelled some as "Potential sources of geometries". We may make a note that we prefer the Google Maps Reverse Geocoding API
The rule (article 25, subsection D, paragraph 3 :) says if there IS a choice in implementation, then it is Parameterized.
We have dwc:geodeticDatum in the Information Elements but not explicitly in the Expected Response. It has been included in the test data but is therefore not explicitly used.
Was:
EXTERNAL_PREREQUISITES_NOT_MET if the bdq:sourceAuthority service was not available; INTERNAL_PREREQUISITES_NOT_MET if the values of dwc:decimalLatitude, dwc:decimalLongitude, and dwc:stateProvince are EMPTY; COMPLIANT if the geographic coordinates fall on or within the bdq:spatialBufferInMeters boundary of the geometry of the given dwc:stateProvince; otherwise NOT_COMPLIANT
I propose:
EXTERNAL_PREREQUISITES_NOT_MET if the bdq:sourceAuthority was not available; INTERNAL_PREREQUISITES_NOT_MET if the values of dwc:decimalLatitude, dwc:decimalLongitude, or dwc:stateProvince are EMPTY; COMPLIANT if the geographic coordinates fall on or within the bdq:spatialBufferInMeters boundary of the geometry of the given dwc:stateProvince after coordinate reference system transformations, if any, have been accounted for; otherwise NOT_COMPLIANT
That suggestion seems reasonable to me.
@ArthurChapman Likewise, with dwc:geodeticDatum retained as an information element. @tucotuco, with that text for the specification, we should remove the text "We have also made the assumption that the use of a spatial buffer obviates the need for references to the SRS." from the notes.
Given that we've parameterized spatialBufferInMeters, and since a user could set this to a small enough value for the difference between one possible datum and another becomes important, then @tucotuco's proposed text seems more robust.
Edited accordingly.
I suggest the Description:
'Do the geographic coordinates fall on or within the boundary from the bdq:sourceAuthority for the given dwc:stateProvince or within the distance given by bdq:spatialBufferInMeters outside that boundary?'
in place of:
'Do the geographic coordinates fall on or within the bdq:spatialBufferInMeters boundary of the given dwc:stateProvince?'
I suggest the Expected Response:
'EXTERNAL_PREREQUISITES_NOT_MET if the bdq:sourceAuthority was not available; INTERNAL_PREREQUISITES_NOT_MET if the values of dwc:decimalLatitude, dwc:decimalLongitude, or dwc:stateProvince are EMPTY; COMPLIANT if the geographic coordinates fall on or within the boundary from the bdq:sourceAuthority for the given dwc:stateProvince (after coordinate reference system transformations, if any, have been accounted for), or within the distance given by bdq:spatialBufferInMeters outside that boundary; otherwise NOT_COMPLIANT'
in place of:
'EXTERNAL_PREREQUISITES_NOT_MET if the bdq:sourceAuthority was not available; INTERNAL_PREREQUISITES_NOT_MET if the values of dwc:decimalLatitude, dwc:decimalLongitude, or dwc:stateProvince are EMPTY; COMPLIANT if the geographic coordinates fall on or within the bdq:spatialBufferInMeters boundary of the geometry of the given dwc:stateProvince after coordinate reference system transformations, if any, have been accounted for; otherwise NOT_COMPLIANT'
Do we need to account for uninterpretable values for INTERNAL_PREREQUISITES_NOT_MET? As in
EXTERNAL_PREREQUISITES_NOT_MET if the bdq:sourceAuthority was not available; INTERNAL_PREREQUISITES_NOT_MET if the values of dwc:decimalLatitude, dwc:decimalLongitude are EMPTY or not valid, or dwc:stateProvince is EMPTY; COMPLIANT if the geographic coordinates fall on or within the boundary from the bdq:sourceAuthority for the given dwc:stateProvince (after coordinate reference system transformations, if any, have been accounted for), or within the distance given by bdq:spatialBufferInMeters outside that boundary; otherwise NOT_COMPLIANT.
@Tasilee Yes, I think so. Good catch.
Wouldn't that take another test though. One would need to test against something to decide if it was valid or not. I don't think this is what we were intending. INTERNAL_PREREQUISITES_NOT_MET is EMPTY is just testing if something is there or not - if nothing there the test can't be run. To determine if it is valid or not would require some sort of further testing
Wouldn't that take another test though. One would need to test against something to decide if it was valid or not. I don't think this is what we were intending. INTERNAL_PREREQUISITES_NOT_MET is EMPTY is just testing if something is there or not - if nothing there the test can't be run. To determine if it is valid or not would require some sort of further testing
It would require multiple other tests, and I don't think this is an isolated example. The coordinates might have to be interpreted first. The point is that the test can not be run meaningfully unless all of the right conditions are met, and having real coordinates is definitely a requirement for running the test.