sul-dlss / cocina-models

Cocina repository data model (implemented in Ruby)
https://sul-dlss.github.io/cocina-models/
3 stars 0 forks source link

Add coordinate validator for map data #663

Closed thatbudakguy closed 4 months ago

thatbudakguy commented 6 months ago

Similar to #662 but related to map data (scanned map images) instead.

We have lots of problems with scanned map data that has coordinates that are in invalid formats; see for example:

@justinlittman also made a report that shows the breadth of coordinate data formats in scanned maps.

This causes problems when the maps are released, but fail to index, and there's no way to know that except that they don't show up in Earthworks.

justinlittman commented 5 months ago

@thatbudakguy Can you specify what the validation would be?

ndushay commented 5 months ago

discussion implied: how the validators work, and what is wished for

kimdurante commented 5 months ago

For EarthWorks indexing and display purposes, all coordinates should be represented using the WGS84 (EPSG::4326) geographic coordinate system. North and South latitudes are valid within the range [-90, 90]. West and East longitudes are valid within the range [-180, 180].

Examples:

Valid coordinate representation for the world [W,E,N,S]:

DMS: W 180°--E 180°/N 90°--S 90° Decimals: W 180.0--E 180.0/N 090.0--S 090.0 Envelope:

-180.0 -90.0 180.0 90.0 Valid coordinate representation for California: DMS: W 124°28ʹ00ʺ--W 114°07ʹ00ʺ/N 42°00ʹ00ʺ--N 32°31ʹ00ʺ Decimals: W 124.48--W 114.13/N 042.01--N 032.53 Envelope: -124.48 32.53 -114.13 42.01
justinlittman commented 5 months ago

If we created this validator, all existing invalid coordinates would need to be fixed. Does that seem feasible?

kimdurante commented 5 months ago

Since this issue affects scanned maps and not geospatial data, I have been talking with the Maps team about possible ways to batch update records with missing or invalidate coordinates. Many older maps do not contain coordinates, or use a different projection, so we could perhaps generate approximate valid coordinates using the place name.

justinlittman commented 5 months ago

In cocina, this would be in description > subject > type = "map coordinates"?

kimdurante commented 5 months ago

Yes

justinlittman commented 5 months ago

The IIIF NavPlace element is created by https://github.com/iiif-prezi/osullivan/blob/main/lib/iiif/v3/presentation/nav_place.rb.

Would the validator (1) validate that every map coordinates subject can be parsed by this code and within the valid range or (2) validate that if a map coordinates subject can be parsed by this code, it is within the valid range or (3) other?

kimdurante commented 5 months ago

I am not familiar with this code. But I would lean towards option 2, validate that if a map coordinates subject can be parsed by this code, it is within the valid range.

lwrubel commented 4 months ago

At what point are the invalid coordinates being created?

We initially create the map coordinates subject in the generate-descriptive step by taking the coordinates in the ISO19139.xml's MD_Metadata/identificationInfo/MD_DataIdentification/extent/EX_Extent/geographicElement/EX_GeographicBoundingBox element. There is logic in the workflow step to convert these decimal coordinates into degrees/minutes/seconds. The intent there is to create a human-readable subject in the PURL.

Later, https://github.com/iiif-prezi/osullivan/blob/main/lib/iiif/v3/presentation/nav_place.rb creates a navPlace for EarthWorks. It first converts the degrees/minutes/seconds back to decimal, using geo_coord.

The ticket is requesting that we identify invalid coordinates when creating the cocina object by creating a cocina validator that would not allow the cocina to be created in generate-descriptive. Alternatively, since we're already doing steps to convert the coordinates, we could raise an error in the workflow step if they are determined to be invalid?

Or is there benefit in doing this as a cocina validator specifically?

kimdurante commented 4 months ago

This issue has to do with scanned maps, not geospatial data, so there is no ISO involved. In many cases, coordinates for scanned maps are either missing or invalid, such as this item: https://purl.stanford.edu/bb847xp3575

I think most, if not all, of the coordinates for our geospatial data should be valid. Errors would be raised earlier on in accessioning if they were missing, malformed, or out of bounds.

lwrubel commented 4 months ago

Thanks for explaining that, now I understand that these don't go through the gisAssembly workflow and why we need validation someplace else.

lwrubel commented 4 months ago

@kimdurante just a heads up that as we proceed on this and are able to identify records with invalid coordinates, we'll need to remediate them before rolling out the validation.

thatbudakguy commented 4 months ago

The target field that coordinate metadata for scanned maps is destined for in solr is called solr_geom (the field was renamed in GeoBlacklight versions 4 and up, but we're on 3.x). The field isn't individually defined, but instead uses the dynamic field configuration for *_geom, which gives it a type of location_rpt:

    <dynamicField name="*_geom" type="location_rpt" stored="true" indexed="true"/>

That field type is defined using solr's impressively named SpatialRecursivePrefixTreeFieldType, or RPT for short, which does...a lot of things:

    <fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType"
               geo="true" distErrPct="0.025" maxDistErr="0.001" distanceUnits="kilometers"/>

From my read of the docs, geo is set to true and format is unset (which will default to WKT, or "well known text"), which means I think the field is expecting to receive input that looks like POLYGON((1 8, 1 9, 2 9, 2 8, 1 8)). Comments in the geoblacklight solr config seem to confirm this:

<!-- Spatial Field Type: Represents the exent of the resource and powers map search functionality.
      Value can be any valid WKT or ENVELOPE String:
        <field name="locn_geometry">POLYGON((1 8, 1 9, 2 9, 2 8, 1 8))</field>
        <field name="locn_geometry">ENVELOPE(-117.312, -115.39, 84.31, 83.1)</field> -->

At index time, we actually generate these WKT values using a helper from stanford-mods:

to_field 'solr_geom', stanford_mods(:coordinates_as_envelope)

The code there looks pretty similar to the validator we just wrote:

      # @return [Array{Stanford::Mods::Coordinate}] valid coordinates as objects
      def coordinates_objects
        coordinates.map { |n| Stanford::Mods::Coordinate.new(n) }.select(&:valid?)
      end

      # @return [Array{String}] values suitable for solr SRPT fields, like "ENVELOPE(-16.0, 28.0, 13.0, -15.0)"
      def coordinates_as_envelope
        coordinates_objects.map(&:as_envelope).compact
      end

...except that we implemented our own Coordinate class with a different regex:

        regex = Regexp.union(
          /(?<dir>[NESW])\s*(?<deg>\d+)[°⁰º](?:(?<min>\d+)[ʹ'])?(?:(?<sec>\d+)[ʺ"])?/,
          /^\s*(?<dir>[NESW])\s*(?<deg>\d+(?:[.]\d+)?)\s*$/
        )

...and it looks like we're converting everything to decimal degrees there.

thatbudakguy commented 4 months ago

Based on slack discussion, we've decided to abandon a cocina validator for coordinates because: