sul-dlss / exhibits

Stanford University Libraries online exhibits showcase
https://exhibits.stanford.edu
Other
19 stars 7 forks source link

Index geo data for MODS extension #274

Closed jcoyne closed 7 years ago

jcoyne commented 8 years ago

Such as cw222pt0426:

<extension displayLabel="geo">
   <rdf:RDF>
     <rdf:Description rdf:about="http://purl.stanford.edu/cw222pt0426"> 
        <dc:format>image/jpeg</dc:format>
        <dc:type>Image</dc:type>
        <gml:boundedBy>
           <gml:Envelope>
              <gml:lowerCorner>-122.191292 37.4063388</gml:lowerCorner>
              <gml:upperCorner>-122.149475 37.4435369</gml:upperCorner>
           </gml:Envelope>
        </gml:boundedBy>
      </rdf:Description>
    </rdf:RDF>
</extension>
atz commented 8 years ago

Implementation of XML parsing to be done in: https://github.com/sul-dlss/stanford-mods/tree/master/lib/stanford-mods

atz commented 8 years ago

@mejackreed and @drh-stanford: While it seems like we previously did not generate multiple geo extension envelopes (perhaps based on the faulty assumption that a solr field could only accommodate one such value), we now are confident we can index multiple envelopes per record. I'm looking for:

atz commented 8 years ago

@caaster You may want to keep this question in mind while querying archivists and users.

caaster commented 8 years ago

@atz -- you mean the multiples issue, correct? I will definitely ask Kim (our SUL geo MD expert) this as a general question now, since I have an open thread with her regarding MD encoding for various forms of geo loc data

atz commented 8 years ago

Right. Indexing is designed to accommodate multiple SRPT boxes per record. But we need to know how that would look in MODS geo extension form.

caaster commented 8 years ago

@atz I will find out & document this

atz commented 8 years ago

Nokogiri has issues navigating our current extension XML. I'm using https://purl.stanford.edu/cw222pt0426.mods as a test fixture, stripped down to:

<mods xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://www.loc.gov/mods/v3" version="3.5"
      xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-5.xsd">
  <extension displayLabel="geo">
    <rdf:RDF xmlns:gml="http://www.opengis.net/gml/3.2/" xmlns:dc="http://purl.org/dc/elements/1.1/">
      <rdf:Description rdf:about="http://purl.stanford.edu/cw222pt0426">
       <dc:format>image/jpeg</dc:format>
       <dc:type>Image</dc:type>
       <gml:boundedBy>
        <gml:Envelope>
           <gml:lowerCorner>-122.191292 37.4063388</gml:lowerCorner>
           <gml:upperCorner>-122.149475 37.4435369</gml:upperCorner>
        </gml:Envelope>
       </gml:boundedBy>
      </rdf:Description>
    </rdf:RDF>
  </extension>
</mods>

I can build a stanfod-mods object from that, but I get a namespace issue with gml. gml is not part of the root node namespace, so Nokogiri has not registered it. Therefore the following behavior is observed:

(byebug) puts @mods_ng_xml.extension.xpath('//rdf:RDF/rdf:Description')
<rdf:Description rdf:about="http://purl.stanford.edu/cw222pt0426">
            <dc:format>image/jpeg</dc:format>
            <dc:type>Image</dc:type>
            <gml:boundedBy>
              <gml:Envelope>
                <gml:lowerCorner>-122.191292 37.4063388</gml:lowerCorner>
                <gml:upperCorner>-122.149475 37.4435369</gml:upperCorner>
              </gml:Envelope>
            </gml:boundedBy>
            </rdf:Description>

(byebug) puts @mods_ng_xml.extension.xpath('//rdf:RDF/rdf:Description/gml:boundedBy')
*** Nokogiri::XML::XPath::SyntaxError Exception: Undefined namespace prefix: //rdf:RDF/rdf:Description/gml:boundedBy

(byebug) puts @mods_ng_xml.extension.xpath('//gml:boundedBy')
*** Nokogiri::XML::XPath::SyntaxError Exception: Undefined namespace prefix: //gml:boundedBy

It is clear that nobody has yet tried to work with these documents as they stand, because they are insufficiently self-describing for our usual methods. The consumer needs to know at which specific moment to apply a novel namespace or XPATH navigation is busted. The namespace is in the document, but that only makes matters worse, since now you need to anticipate the extra-special declaration or look for it at every node. Parsing for navigation is now a multi-pass operation, and if the same approach is repeated, welcome to XML hell.

The namespace should appear in the root node, for it to be intelligibly used deeper in the XML tree. Can anybody explain why that approach was not pursued? Presumably that complicates the test suite that pretends all mods records have the same basic NS declaration, but in reality, clearly different records are using different namespaces, so...

We need to decide whether to go with:

atz commented 8 years ago

So here is what it looks like if we allow the parser special foreknowledge of the namespace:

@mods_ng_xml.extension.xpath(
  '//rdf:RDF/rdf:Description/gml:boundedBy', 
  'gml' => 'http://www.opengis.net/gml/3.2/', 
  'rdf' => "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
)
   <gml:boundedBy>
      <gml:Envelope>
        <gml:lowerCorner>-122.191292 37.4063388</gml:lowerCorner>
        <gml:upperCorner>-122.149475 37.4435369</gml:upperCorner>
      </gml:Envelope>
   </gml:boundedBy>

Note: to use a single xpath statement like that you also need to explicitly redeclare invoked namespaces that were already registered with the root node, hence the rdf argument. This is a pretty serious impediment to extension down the road, and likely to introduce confusion when different versions of the underlying namespace are used.

mejackreed commented 8 years ago

@atz I think @caaster and @drh-stanford might be best equipped to help find specific examples of metadata.

As I understand things, I'm not really sure how or when we geo from a GeoNames URI to the mods geo extension. In some cases both are used, in other one or the other.

geo extension is used to determine boundaries or geometry of a given object. Sometimes this correlates to a geographic subject term that is created as a GeoNames URI. Additional geographic subjects may be included as GeoNames URI's but may not be geographically representative of the object (eg. traditional geographic cataloging listing a hierarchy of geographic subject terms).

I would also push back here in saying that this work can hopefully better inform our cataloging practice. @caaster should be able to locate examples that are driving the utility of this work also.

drh-stanford commented 8 years ago

I would defer to Kim on the metadata changes needed, and also I would like to see an example of an item that has multiple bounding boxes.

Since we use gml:Envelope you cannot include multiple geometries within the envelop, but gml:boundedBy may be more flexible to include a mutli-polygon. I would have to dig into the spec to verify that. Otherwise, you'd need some way to indicate an array -- maybe just multiple gml:boundedBy statements.

Hope that helps.

@kimdurante

drh-stanford commented 8 years ago

Also, re: placenames, we use dc:coverage for the placenames and it's 0 or more elements. You can see an example here: https://purl.stanford.edu/cg357zz0321.mods

atz commented 8 years ago

This ticket is pending the stanford-mods PR: https://github.com/sul-dlss/stanford-mods/pull/87

kimdurante commented 8 years ago

If I am understanding this issue correctly, the namespaces needed would be:

http://www.opengis.net/gml/3.2/ AND http://purl.org/dc/elements/1.1

-I need to add these to the iso2mods.xsl in order to reference them from the root element?

On Wed, Aug 3, 2016 at 10:49 AM, Darren Hardy notifications@github.com wrote:

I would defer to Kim on the metadata changes needed, and also I would like to see an example of an item that has multiple bounding boxes.

Since we use gml:Envelope you cannot include multiple geometries within the envelop, but gml:boundedBy may be more flexible to include a mutli-polygon. I would have to dig into the spec to verify that. Otherwise, you'd need some way to indicate an array -- maybe just multiple gml:boundedBy statements.

Hope that helps.

@kimdurante https://github.com/kimdurante

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sul-dlss/exhibits/issues/274#issuecomment-237307391, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQVMF7MZ7_P_9yaq2a1od3_eKwKIzDqks5qcNSxgaJpZM4JWsXr .

atz commented 8 years ago

Thanks. The namespace issue is resolved (via the 3rd option).

@kimdurante: The question outstanding is representing multiple geo boxes in a record. Do we now or will we use:

atz commented 8 years ago

@drh-stanford Here's an example of a record with many subject/geographic geonames:

https://github.com/OpenGeoMetadata/edu.stanford.purl/blob/master/rg/765/zt/7618/mods.xml#L57-L98 https://purl.stanford.edu/rg765zt7618.mods

When indexed for spotlight, those will be resolved via geonames into an equal number of boxes, plus the single box extracted from the geo extension (which may be duplicative). They are fed into a blacklight-heatmaps display, so overlaps and sub/supersets are fine.

caaster commented 8 years ago

I am hesitant about this discussion, because I think this may well represent geo MD edge cases at the moment. The MD example @atz cited above is for EW content. A curator could in the future, wish to add EW content to a Spotlight exhibit I suppose -- but this not high pri to focus on. Happy to discuss why from a service pov a little more -- please weigh in if I am misunderstanding intent here.

atz commented 8 years ago

Yes, the point is to anticipate the edge case now. Ideally we call our shot and avoid having to retouch anything in the future, but at the very least, we shouldn't break when encountering such data. But I need a domain expert to assert which of those 5 options we are using/expecting.

caaster commented 8 years ago

@atz honestly, I think it is higher priority to work on making sure the geo plugin can also accommodate lat/long coordinates -- @snydman can comment on this (or I can -- I can even create a git ticket). Because the many names issue you mention -- really is a real edge case for now & I think other tickets are higher priority. That said, I am meeting with Kim Durante tomorrow afternoon to discuss this ticket thread and get her recommendations. I know from briefly talking to her this afternoon this is very much new territory from her perspective, as well.

kimdurante commented 8 years ago

@atz I had to get some background on the work you're doing in order to understand this issue.

I would recommend the following: None: Prohibit multiple bounding boxes.

Display of an item's geographic extent should first check the MODS geoExtension for the existence of a bounding box (gml:Envelope) or point location (gml:Point). If those values do not exist, use the bounding extent using GeoNames.

caaster commented 8 years ago

@atz -- here is a fixture object for this specific ticket: druid:cm896kp1291

The difference now is, going forward we will use the MODS geo extension for all geolocation MD -- whether point data like this example/ticket spec (lat/long only) -- or bounding boxes. So, the fixture object above uses the MODS geo extension to encode the point data.