papyri / sosol

The Son of Suda On Line
GNU General Public License v3.0
15 stars 13 forks source link

Improving DCLP Metadata #325

Open samosafuz opened 2 years ago

samosafuz commented 2 years ago

I've noticed some problems with the DCLP metadata mask and either the xml it produces or the xslt that processes it for PN. My particular area of concern is the element <term> which is the great-grandchild of <profileDesc>:

<profileDesc>
  <textClass>
    <keywords>
      <term>

After poring over LDAB documentation regarding its keywords and classification system, I now better understand why DCLP keyword metadata is structured the way it is. I'm also now in a position to diagnose what's unsatisfactory and to make concrete recommendations.

<term> is often elaborated upon by @type (e.g., description, religion, culture), but the present handling of these is suboptimal. As I describe how in what follows, the following image will be helpful as a reference point -- this is the current SoSOL metadata mask for DCLP:

Screen Shot 2022-09-06 at 2 44 36 PM

I note the following issues:

  1. A string added via the Genre field is tagged <term type="description"> 1a. PN xslt at present does not process items tagged <term type="description">, so this field is deceptive. I've been moving terms like epic or philosophy to the Genre field, only to see them vanish from PN as a result.
  2. A string added via the Keyword field is tagged <term> (with no attribute) 2a. PN xslt at present processes <term/> (with no attribute) so that it appears in PN under Genre. This also strikes me as somewhat deceptive, or at least not as well formed in xml as it could be.
  3. When set next to literary genres such as history, epic, comedy, philosophy (etc.), poetry and prose seem to be making a broader distinction. But the xml at present for all of these is <term> (with no attribute), except for cases where dutiful editors like myself have moved generic descriptions such as epic or philosophy to the Genre field, only to see them vanish from PN as a result. 3a. I therefore suggest implementing @type="format", whose values could be poetry or prose, with undetermined as a third option. They should be an authority list, accessed via dropdown menu in SoSOL. (NB: poetry and prose only appear as <term> for files whose culture is literature (i.e., <term type="culture">literature</term>) 3b. I would also suggest implementing @type="genre" across DCLP to tag the wide range of literary genres and other descriptive terminology. Doing so would allow us to dispense with <term type="description">, which is what the Genre field of the metadata mask currently outputs (but which current xslt does not process). Doing so would allow us to use <term> (with no attribute) for non-generic descriptive terminology, e.g. calendar, tachygraphy, exercise, drawing, title, etc. 3c. IMO, it is ok if the xslt prints @type="genre" and @type="format" together in PN under Genre, so long as the xml disambiguates. This is basically what the current system achieves (where both are tagged <term>, without attribute)

For illustrations of 1, 1a, 2, and 2a, see the following two images of 171900:

Screen Shot 2022-09-02 at 2 12 06 PM Screen Shot 2022-09-02 at 2 12 22 PM

For illustrations of 3, see the following two images of 60408:

Screen Shot 2022-09-06 at 3 14 20 PM Screen Shot 2022-09-06 at 3 14 58 PM

Further improvements are possible, too

  1. The options for <term type="culture"> should also be governed by an authority list, accessed via dropdown menu in SoSOL: the four options are literature, science, religion, and art. 5a. Sometimes two options are listed (i.e., <term type="culture">science or religion</term>), which the authority list will require splitting up. We will therefore also want to allow for more than one item to be tagged <term type="culture">, in which case xslt will have to add Or between them for display in PN.
  2. All keyword fields should have the same tickbox for 'unclear' (adding @cert="low" to the xml) as the metadata mask section for Provenance currently does. This change would also require xslt to print (?) after the keyword in question

There are potentially further steps we could take to improving the handling of DCLP keyword metadata, but since what I've suggested already will require changes to DCLP metadata xslt and SoSOL, it seemed wise to at least start the conversation. I'm happy to discuss when you get the chance: I'm expecting another dump of metadata from TM in the near future that will allow me to Xwalk xml for literary papyri published since the dawn of DCLP, and it would be good to have a sense of how I want to wrangle it all. So long as I understand how SoSOL will work moving forward, I can wrangle the existing data on my own.