sul-dlss / exhibits

Stanford University Libraries online exhibits showcase
https://exhibits.stanford.edu
Other
19 stars 7 forks source link

Need italics to display for selected MD items in an exhibit #1981

Closed caaster closed 1 year ago

caaster commented 3 years ago

As an exhibit creator for the coming Martin Wong Catalog Raisoneé exhibit, I need italics to display in the metadata record for journal titles, book titles and titles of exhibitions, on both the PURL page and on the exhibit item page. It will be hard for art historians to understand bibliographic citations without these.

Example - italics need for Exhibition History and Related Publications: https://purl.stanford.edu/fk360xf2847

To expand further, here is the specific issue, as exemplified in this related publications entry for pd464yj1635:

“Think Crazy: The Art and History of Delirium” is a chapter in the following book: Delirious: Art at the Limits of Reason, 1950-1980. This entry will be confusing to the researcher if both the chapter and the book are displayed in quotes. The correct/preferred way for this to appear as a citation is for the chapter to appear in quotes and the book to appear in italics.

caaster commented 3 years ago

From: Arcadia Falcone arcadia@stanford.edu Sent: Friday, February 26, 2021 2:50 PM To: Catherine A. Aster caster@stanford.edu Subject: RE: Questions re a Wong MD display request

Hello Cathy,

I can’t find anything that says HTML is not allowed in MARC values. My main concern would be for interoperability outside Stanford, but as we don’t currently share our MODS records externally in any systematic way I’m OK with letting that be a problem for future us. This may impact the modsulator as well in making sure that the output formats the HTML correctly.

Best, Arcadia


From: Catherine A. Aster caster@stanford.edu Sent: Friday, February 26, 2021 12:56 PM To: Arcadia Falcone arcadia@stanford.edu Subject: Questions re a Wong MD display request

Hi Arcadia,

Today I met with Jack & Gary to discuss the following request from the Wong team: https://github.com/sul-dlss/exhibits/issues/1981

Jack has a question and a comment:

If the above is sufficiently clear, please feel free to comment directly on the ticket. If you have follow-on questions, please email me or we can do a slack call.

Thanks, Cathy

caaster commented 3 years ago

Response from Arcadia (edited as she mentions other items that don't seem to me that they apply here):

When I added the HTML tags the MODS did raise validation errors, so using standard HTML within the MODS is not feasible after all. I’ve attached a couple of sample records with two different workarounds – one encloses the HTML in an XML comment, and the other swaps out the HTML angle brackets for double curly brackets. Let me know if I can provide anything else.

Fixtures available here: https://drive.google.com/drive/folders/1PZPenNHvHCok30OQ50x9w1WjUEIiQiY9

anarchivist commented 3 years ago

Could we possibly use CDATA for this? e.g.

<mods xmlns:xlink="http://www.w3.org/1999/xlink"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://www.loc.gov/mods/v3"
    version="3.7"
    xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-7.xsd">
    <titleInfo>
        <title><![CDATA[<em>Firefly Evening</em>]]>, sketches</title>
    </titleInfo>
    <!-- snip -->
    <note type="publications" displayLabel="Related publications">Doryun Chong and Cosmin Costinas, eds. <![CDATA[<em>Taiping Tianguo: A History of Possible Encounters: Ai Weiwei, Frog King Kwok, Tehching Hsieh, and Martin Wong in New York</em>]]>, (Berlin: Sternberg Press, 2015),137.</note>
</mods>
caaster commented 3 years ago

@arcadiafalcone writes:

The CDATA approach doesn’t seem to interfere with MODS validation, so it’s a possibility. My one concern would be to make sure it works with character encoding.

For example:

台湾 is entered in UTF-8 encoding in input data

In Argo those characters appear instead as numerical character references: 台湾

In SearchWorks, the display shows 台湾 (which also appears in the source code)

[Sidebar: apparently purl doesn’t display the vernacular form of the main title at all? That is not ideal.]

CDATA is parsed literally, so if the source metadata in Argo/DOR looks like this:

<title><![CDATA[<em>&#x53F0;&#x6E7E;</em>]]></title>

The display would show that exact string in italics:

&#x53F0;&#x6E7E;

So what we may need to do is enclose only the markup in CDATA so that the encoding of the actual value is not affected. Instead of the above, we could have:

<title><![CDATA[<em>]]>&#x53F0;&#x6E7E;<![CDATA[</em>]]></title>

Which should then display as:

台湾
arcadiafalcone commented 3 years ago

In a code block so the syntax displays correctly:

The CDATA approach doesn’t seem to interfere with MODS validation, so it’s a possibility. My one concern would be to make sure it works with character encoding.

For example:
台湾 is entered in UTF-8 encoding in input data
In Argo those characters appear instead as numerical character references: &#x53F0;&#x6E7E;
In SearchWorks, the display shows 台湾 (which also appears in the source code)
[Sidebar: apparently purl doesn’t display the vernacular form of the main title at all? That is not ideal.]

CDATA is parsed literally, so if the source metadata in Argo/DOR looks like this:
<title><![CDATA[<em>&#x53F0;&#x6E7E;</em>]]></title>
The display would show that exact string in italics: 
&#x53F0;&#x6E7E; [in italics not displaying in code block]

So what we may need to do is enclose only the markup in CDATA so that the encoding of the actual value is not affected. Instead of the above, we could have:
<title><![CDATA[<em>]]>&#x53F0;&#x6E7E;<![CDATA[</em>]]></title>
Which should then display as:
台湾  [in italics not displaying in code block]
anarchivist commented 3 years ago

To determine:

arcadiafalcone commented 2 years ago

Some things to consider:

  1. Representing the italics markup in spreadsheet input
  2. Representing the italics markup in MODS XML
  3. Representing the italics markup in Cocina JSON
  4. Representing the italics markup in the Argo descMetadata datastream (until migration off Fedora)
  5. Ensuring that MODS<>Cocina transformations preserve the markup
  6. Ensuring that metadata delivered to access systems preserves the markup
  7. Ensuring that access systems can interpret the markup appropriately wherever the data is displayed

Ideally the markup itself would not display if the target system was unable to interpret it.

One possible approach, if feasible:

  1. In spreadsheet note field: This is a <i>Title</i> for display.
  2. In MODS XML: <note>This is a <![CDATA[<i>]]>Title<![CDATA[</i>]]> for display.</note>
  3. In Cocina JSON: note: [{ value: 'This is a <i>Title</i> for display.'}] Display: This is a Title for display.

Italics is the only such formatting use case I know of, so the allowed markup/CDATA content could be constrained to the <i> and </i> tags. Martin Wong use cases include italics in the note and abstract field. Stanford University Press digital monographs have similar use cases. Amos Gitai also has a use case for italics in the title.

anarchivist commented 2 years ago

From today's meeting: As we understand it, items 1-6 in @arcadiafalcone's first list in the previous comment fall the within the Infrastructure Team's portfolio, and item 7 is the Access Team's responsibility. @caaster will follow up with Vivian to see what we need to do to move this work forward with the Infrastructure Team.

arcadiafalcone commented 2 years ago

@anarchivist Is <em> preferable to <i> as the italics tag?

anarchivist commented 2 years ago

@arcadiafalcone I probably miswrote above - <i> is preferable since we're trying to identify a set off title instead of indicating emphasis.

ggeisler commented 2 years ago

Edit: @anarchivist beat me by 2 seconds!

FWIW, based on the examples motivating this feature (that I'm aware of, anyway: journal titles, book titles and titles of exhibitions), I believe <i> is the more appropriate element to use, because we're usually trying to set off the contained text, rather than emphasize it.

mjgiarlo commented 2 years ago

Some things to consider:

  1. Representing the italics markup in spreadsheet input
  2. Representing the italics markup in MODS XML
  3. Representing the italics markup in Cocina JSON
  4. Representing the italics markup in the Argo descMetadata datastream (until migration off Fedora)
  5. Ensuring that MODS<>Cocina transformations preserve the markup
  6. Ensuring that metadata delivered to access systems preserves the markup
  7. Ensuring that access systems can interpret the markup appropriately wherever the data is displayed

Ideally the markup itself would not display if the target system was unable to interpret it.

One possible approach, if feasible:

  1. In spreadsheet note field: This is a <i>Title</i> for display.
  2. In MODS XML: <note>This is a <![CDATA[<i>]]>Title<![CDATA[</i>]]> for display.</note>
  3. In Cocina JSON: note: [{ value: 'This is a <i>Title</i> for display.'}] Display: This is a Title for display.

Italics is the only such formatting use case I know of, so the allowed markup/CDATA content could be constrained to the <i> and </i> tags. Martin Wong use cases include italics in the note and abstract field. Stanford University Press digital monographs have similar use cases. Amos Gitai also has a use case for italics in the title.

This is super helpful, @arcadiafalcone, thank you. What all are the sources of these data that will contain the italics? I.e., will it all come from spreadsheets (flowing through modsulator)? I'm asking so I can get a handle on what all needs to be touched to accommodate this feature.

arcadiafalcone commented 2 years ago

@mjgiarlo In current Argo, it could be a spreadsheet ingest, MODS ingest, or inline editing of the descMetadata datastream.

jcoyne commented 2 years ago

@arcadiafalcone in the spreadsheet ingest would you apply the italic style to the text, or would you be including literal html markup?

How should this work for non-html capable viewers of this data?

jcoyne commented 2 years ago

Ideally the markup itself would not display if the target system was unable to interpret it.

Wouldn't it have to interpret it first to determine whether or not it can be displayed?

arcadiafalcone commented 2 years ago

@jcoyne

@arcadiafalcone in the spreadsheet ingest would you apply the italic style to the text, or would you be including literal html markup?

It would need to be literal HTML markup to support the use cases where the user loads a CSV or XML (or JSON).

How should this work for non-html capable viewers of this data?

What do you mean by viewers? Could you give an example?

Wouldn't it have to interpret it first to determine whether or not it can be displayed?

I was thinking of when users put in <i> tags in the past and they displayed in SearchWorks. But if SearchWorks, purl, and Spotlight all are capable of displaying italics when the source data includes a <i> tag in certain fields, and the source data is processed to strip out any <i> tags that aren't in those fields, it shouldn't be an issue.

jcoyne commented 2 years ago

I think we're moving toward making our data incompatible with any other system that uses MODS. To me the value in following standards is that our data is interoperable.

For example, lets say someone makes software that makes PDFs out of MODS. The PDF language doesn't know anything about HTML. It would not property handle this data.

This proposal is to basically expand the standard such that only readers who are aware of Stanfords version of MODS can make sense of these records. I think this sort of issue should be brought before the maintainers of the MODS standard so that this use case can be incorporated into the standard. Have we looked into doing that?

mjgiarlo commented 2 years ago

I am wondering if we might want a more Cocina-oriented solution, essentially extending the model to allow users to specify both a metadata value and how it should be formatted (as a sidecar assertion). Then, systems could use the formatting extension(s) if supported or ignore them if not? We have a rich, structured metadata model here and it occurs to me that we could leverage (and add to) that richness rather than stuff potentially not-well-formed data into metadata representations that downstream systems would then need to know about and handle.

arcadiafalcone commented 2 years ago

@mjgiarlo I like the idea of making it more Cocina-oriented, and can imagine how that might be modeled. My questions then would be:

  1. What would a user enter into a spreadsheet in order to generate this kind of structure in Cocina?
  2. How would this information be delivered to access systems as XML?
  3. Does this work with the Martin Wong project timeline?
mjgiarlo commented 2 years ago

Great questions! I don't know what the Martin Wong project timeline is, myself.

cc: @vivnwong @caaster @anarchivist

I am wondering if this request is complex enough, and touches enough systems and data structures, that we might want to schedule dedicated team time to analyze and work on it. It's feeling to me like more than we can do in a one-off maintenance week, and work that would benefit from cross-team analysis and testing rather than having it come in as side work to our current work cycle.

anarchivist commented 2 years ago

Great questions! I don't know what the Martin Wong project timeline is, myself.

Based on my understanding of @caaster's discussions with the Wong project team, we need to have something implemented by April. Our window is definitely closing to get this resolved.

caaster commented 2 years ago

We need to have this implemented by 1 April for the Wong project -- confirmed.

jcoyne commented 2 years ago

@anarchivist I don't think we can make this change in SDR by April. We are in the middle of a workcycle that moves us away from persisting MODS. This change would be a substantial effort that we're just hearing about. This would touch a whole bunch of SDR code bases and would require us to put aside our current work to pivot to this.

caaster commented 2 years ago

@jcoyne - hearing that you can't do this by April, so it would be helpful to know by what date you could feasibly accomplish this. Thank you.

mjgiarlo commented 2 years ago

On one hand, citation formatting is not trivial; it is semantically significant to many of our users and potentially misleading to render w/o intended formatting.

On the other, the solutions we've discussed seem kludgy---in that we're proposing to mix values with formatting information, but then only conditionally apply said formatting---with potentially large and leaky side effects. And we'd be attempting to tackle this work at the same time as we're putting significant effort into moving away from storing MODS in Fedora.

I wonder whether it might be possible to negotiate with our stakeholders on the scheduling of this one requirement. That would give us time to get our heads together (@sul-dlss/access-team, @sul-dlss/infrastructure-team, @arcadiafalcone, @andrewjbtw), do some planning, and figure out how best to support this.

caaster commented 2 years ago

It would be helpful to know by what date you all could feasibly accomplish this requested change. Thank you.

Here is some additional info.

jcoyne commented 2 years ago

I think there's a good likelyhood that we'd have to patch long dead libraries like activefedora, om and rubydora and possibly Fedora 3 itself, because those do a lot of normalization and I know they haven't been tested to support CDATA tags. We'd also be looking at the XML editor in argo and building custom parsers to make sure the values are permitted and well formed. In the past we've just been relying on tools from LOC, but we're now expanding the format. Additionally we need to validate and update the COCINA data model to check for validity and well formedness. We'd rewrite a chunk of modsulator to check validity and wellformedness. We'd need to update and validate round trip mappings between Fedora and COCINA data models. We'd also have to check that this is supported correctly by our Solr serialization/deserialization. Argo would also need it's display logic changed so that it could handle these new formatting directives. The robots code would also need to be updated to be able to validate this new XML.

Any XSLT transformation from MODS such as https://github.com/sul-dlss/dor-services-app/blob/main/app/services/publish/mods2dc.xslt would need to be validated, because this new change would introduce XML into the output, which is not what we are expecting.

jcoyne commented 2 years ago

It seems to me that given the potential scope of work we're looking at, we wouldn't be doing our due diligence without making an attempt to ask the MODS editors what the preferred way to cite a book chapter is. Is there a deficiency in MODS here? I don't think that I'm the most qualified/knowledgeable person to make this inquiry, but I will happily do so if there are no other volunteers.

vivnwong commented 2 years ago

@caaster, we start the first WC to replace Fedora. I scheduled a minimum of two WCs to do the work. It is too early to know when to take on other work (including this request). I can let you know a better estimation by the end of the first WC (middle of March).

vivnwong commented 2 years ago

I would encourage the team to collect the information and analyze the different approaches.

arcadiafalcone commented 2 years ago

@jcoyne I am on the MODS editorial committee and could bring this up for discussion at the next meeting. But I think that in terms of what gets delivered to our discovery systems, this is less an issue of representing-formatting-in-MODS than representing-formatting-in-XML, and it appears the CDATA approach is the accepted way to do that.

jcoyne commented 2 years ago

Unfortunately, that's not a great solution when we use XSLT to transform the XML. Then the transformed output can be invalid.

andrewjbtw commented 2 years ago

This has probably been considered, but what about encoding the fact that a title is a title in the metadata and then leaving it up to consuming applications to determine what to do with that information? Then the display code could apply italics to any "title" data found in a note or abstract.

Encoding the title-ness of a title, rather than encoding one specific way to format it, should make the data more interoperable with other systems. I could see someone wanting to extract all titles from a bibliography, or wanting to add hyperlinks to all titles, in addition to wanting to italicize the titles. Encoding just the italics would make other uses more difficult, as consuming applications would need to understand the specifics of how we encoded the italics and would not have any clear indication in the encoding of why the text was italicized.

I don't know whether marking up titles as titles would be compatible with MODS or what impact this would have on Cocina. But every solution to this problem will require marking up the specific blocks of text that need special handling (formatting, linking, entity extraction, etc.), so it seems like a spreadsheet or other interface change that would allow users to declare "italicize this text" could also be used to declare "this text is a title."

If there are other use cases for italics than titles, then this suggestion wouldn't cover that. But if we're going to add markup to our description, I think there would be value in marking up the titles as titles and then handling other cases as appropriate to those data types and needs.

arcadiafalcone commented 2 years ago

Following up on @mjgiarlo's 1/18 comment above: I could see representing this in Cocina as something like

{
  note: [
    {
      structuredValue: [
        {
          value: 'Title',
          style: 'italic'
        },
        {
          value: 'and other stuff'
        }
      ]
    }
  ]
}

mapping to whatever output is necessary to display "Title and other stuff".

justinlittman commented 2 years ago

That seems like a very complex way of representing styling. It would add a lot of complexity to any code that was trying to process cocina descriptive.

jcoyne commented 2 years ago

Is it possible to know which styles should be added to the displayed HTML from the Cocina without adding style markup? Can we dereference the related items, and if it is a title from one of the specified types (journal, book, exhibit), it could be italicized?

jcoyne commented 2 years ago

I think it's also worth considering how we prevent h2 users from abusing this feature to inject arbitrary markup that displays in purl.

jcoyne commented 2 years ago

It looks to me like we should be using the <cite> (rather than <i> or <em>) as this is the proper semantic markup and may be better for accessibility. See https://developer.mozilla.org/en-US/docs/Web/HTML/Element/cite

Typically, browsers style the contents of a element in italics by default.

thatbudakguy commented 2 years ago

closing, since the desired display for Martin Wong items has been achieved per @caaster.

caaster commented 1 year ago

Martin Wong project stakeholders have just reported (in the last week), that exhibit items with <i> markup in the metadata are no longer displaying as italics. This was a key requirement for the launch of the Martin Wong Catalog Raisonee in Fall 2022. Please see this item as an example: https://exhibits.stanford.edu/martin-wong/catalog/md719gg1885.

hudajkhan commented 1 year ago

In the example above (https://exhibits.stanford.edu/martin-wong/catalog/md719gg1885), the description field shows "Wong’s work, Untitled (costume fragment from the Angels of Light “Peking on Acid” performance". The underlying HTML uses "<i>" which displays as directly instead of actually providing italicization.

caaster commented 1 year ago

@jcoyne - from Arcadia:

"Exhibition history and related publications are both MODS notes. The field labeled "Description" (the one reported) is MODS abstract. Abstract was definitely included in the infra side ticket (where it mattered less because in cocina abstract is just another type of note), but may have dropped off the ticketing on the access side.

So on the access side the MODS fields that use italics are note and abstract."

jcoyne commented 1 year ago

@caaster do you know what fields note and abstract go into in terms of the Exhibits software?

caaster commented 1 year ago

@jcoyne Arcadia just explained further, "I think exhibition history and related publications are fine – they’re not displayed on the exhibit item show page or in search results, and the more details/purl views always had the italics. So if the description has the italics in Spotlight, the issue should be resolved."

jcoyne commented 1 year ago

@caaster okay. I think this issue is resolved now.