sul-dlss / mods_display

MODS Display is a gem to centralize the display logic of MODS medadata.
Other
2 stars 5 forks source link

Consistent handling of line breaks #78

Closed arcadiafalcone closed 2 years ago

arcadiafalcone commented 2 years ago

Desired behavior: The user enters 
 as part of a value in a replayable spreadsheet or MODS file that is uploaded to Argo, or by editing the datastream in Argo directly. This may appear in the note or abstract element, possibly nested under relatedItem. The display in purl, SearchWorks, and Spotlight (and eventually the data catalog) interprets 
 as a line break. In Spotlight, the line break should appear both on the item show page and in the "More details" view.

See below for update.

arcadiafalcone commented 2 years ago

Argo data (what shows as line breaks was entered as 

 - closing the datastream converted the character to an actual line break): image

Desired display (taken from purl): image

arcadiafalcone commented 2 years ago

The line break character should display in the MODS XML accessible via purl. I think it would also make sense to display the character in Argo rather than converting to an actual line break.

Further investigation: Check how replayable spreadsheet encodes line breaks within cells (Excel, Google Sheets, CSV).

arcadiafalcone commented 2 years ago

To reduce confusion and error on the user side, it would be preferable to recognize line break characters generated by applications commonly used to create metadata for the SDR, primarily spreadsheet applications. So far as I can tell, Microsoft Excel, Mac Excel, and Google Sheets when converted to CSV all represent line breaks within a value as the linefeed character (\n). Regardless of what character is used to represent line breaks internally, a linefeed character in input should be recognized as a line break.

Examples from Martin Wong collection Google Sheet: nd259gr6577 has line breaks in note with type "exhibitions" xh643gr2981 has line breaks in abstract

caaster commented 2 years ago

Martin Wong collection MD Google sheet: https://docs.google.com/spreadsheets/d/1IdFPoiv6BPJwX1cHZyt2oskY1e6mVX_p6brbhNrZyAw/edit#gid=0

thatbudakguy commented 2 years ago

my understanding is this has (maybe) two parts?

  1. Infrastructure ensures that newlines in input spreadsheets (\n) get converted to 
 in the metadata
  2. Access takes display logic that converts 
 into line breaks from purl (where it works correctly) and ports it to any other properties that need to display the line breaks
mjgiarlo commented 2 years ago

Hi, all! I'm documenting what all work this and sul-dlss/exhibits#1981 needs on the @sul-dlss/infrastructure-team side, hoping to have it written up early next work week for @vivnwong to review (so that work can be resourced, scheduled, and prioritized).

@arcadiafalcone Should we (or, I guess, modsulator in this case?) only convert linefeeds to 
 for Martin Wong-related work or for all data? If the latter, do you have thoughts on how we should identify spreadsheets as being affiliated with this collection?

arcadiafalcone commented 2 years ago

@mjgiarlo For all data, not just Martin Wong.

mjgiarlo commented 2 years ago

@arcadiafalcone OK, thanks! And I have now exported the MW google sheet as an xlsx to test running through modsulator.

mjgiarlo commented 2 years ago

@arcadiafalcone @thatbudakguy @caaster @vivnwong AFAICT, this change shouldn't require much development work given that the stanford-mods-normalizer gem already does this:

https://github.com/sul-dlss/mods_normalizer/blob/master/lib/stanford/mods/normalizer.rb#L58-L82

A problem currently is that modsulator is stripping out control characters before the stanford-mods-normalizer gem runs ☝🏻 code:

https://github.com/sul-dlss/modsulator-app-rails/blob/main/app/models/modsulator_sheet.rb#L31

This line might also work against us, which needs testing:

https://github.com/sul-dlss/modsulator-app-rails/blob/main/app/lib/modsulator.rb#L104

Excluding time from @arcadiafalcone (e.g., help with testing in QA/stage?), I suspect this change would require no more than 1-2 developer days' worth of work.

mjgiarlo commented 2 years ago

@arcadiafalcone Should this be restricted to certain fields or should line breaks be allowed in all fields?

arcadiafalcone commented 2 years ago

@mjgiarlo Current use cases are for <note> and <abstract> (either at the top level or nested under <relatedItem>).

thatbudakguy commented 2 years ago

@arcadiafalcone I'm looking at nd259gr6577 right now (in the context of exhibits), and it looks like there's a single note with "exhibitions" type where the values are semicolon-separated. i think if there were multiple notes with the "exhibitions" type, as there are for "related publications", we'd get the semantically correct multiple <dd>s, which would display how we want:

Image

does this make sense?

arcadiafalcone commented 2 years ago

I see how that could work for exhibition history, though I'd have to do a data check with the Martin Wong folks. I know there is at least one note that quotes a stanza of poetry that it wouldn't make sense to break up in that way.

arcadiafalcone commented 2 years ago

Anneliis and I tested this in argo-prod today and weren't able to generate line breaks in the display. All testing of line breaks was done in notes with display label "Exhibition history."

Behavior 1) Download cocina spreadsheet for five sample druids 2) Upload CSV to Google sheets 3a) Add line breaks to exhibition history notes by using cmd/ctrl + Enter 3b) For druid pd464yj1635, we also tried adding the string "\n" to the end of each line (in addition to the visible line breaks) 4) Confirmed that objects completed accessioning 5) Viewed purls 6) Line breaks did not display (but italics did!!!) 7) Checked the cocina in Argo - pd464yj1635 represents the line breaks as "\n", vh115kc5665 as "\n"

Data Druids: druid:pd464yj1635 druid:vh115kc5665 druid:ns768mm2958 druid:qr959jx6172 druid:mg552nh0170

Test spreadsheet: martin_wong_display_test.csv

Additional context I've also noticed some odd behavior in how non-Martin Wong records with line breaks display on purl (abstracts in Hydrus objects). 1) https://purl.stanford.edu/bb782yf9388 has line breaks, but also displays escaped characters. The cocina has line breaks as "\r\n". 2) https://purl.stanford.edu/bb490zh2544 has line breaks (visible when resizing window). The cocina has line breaks as "\n" (plus one in a note as "\r\n")

cbeer commented 2 years ago

https://github.com/sul-dlss/purl/pull/594 will fix theof linebreaks for bb782yf9388. Unless I'm missing something, that's also the only record I see that encoded intentional line breaks with the expected&#xA; (although bb782yf9388 seems to now use the decimal encoding &#13;.. which I guess ought to be fine too?)

thatbudakguy commented 2 years ago

purl v4.3.2, which is now live, does seem to have fixed bb782yf9388.

mjgiarlo commented 2 years ago

@cbeer @thatbudakguy @arcadiafalcone

AFAICT, this test object in stage (which has snippets of MW data) uses the same newline approach for both abstracts and notes:

$ curl https://sul-purl-stage.stanford.edu/kd791zq6661.xml

XML snippet:

<?xml version="1.0" encoding="UTF-8"?>                                                                                               
<publicObject id="druid:kd791zq6661" published="2022-07-06T20:07:08Z" publishVersion="cocina-models/0.82.0">                         
  <mods xmlns="http://www.loc.gov/mods/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:rdf="http://www.w3.org/1999/02
/22-rdf-syntax-ns#" xmlns:xlink="http://www.w3.org/1999/xlink" version="3.7" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://ww
w.loc.gov/standards/mods/v3/mods-3-7.xsd">
    <titleInfo>
      <title>Testing italics and newlines again</title>
    </titleInfo>
    <abstract>I am in a <![CDATA[<cite>]]>CSV desc upload<![CDATA[</cite>]]> file!&#13;Breaks surround me.

OK, now we shall test <![CDATA[<i>]]>ITALICS<![CDATA[</i>]]></abstract>
    <note displayLabel="Exhibition history" type="exhibitions"><![CDATA[<i>]]>Situations<![CDATA[</i>]]> (group exhibition) Bess Cutl
er Gallery, New York, January-February 11, 1984;

<![CDATA[<i>]]>Aspects of the City<![CDATA[</i>]]> (group exhibition) The Metropolitan Museum of Art, New York, July 31-September 30,
 1984;

<![CDATA[<i>]]>Nueva Pintura Narrativa: Coleccion del Metropolitan Museum of Art, Nueva York<![CDATA[</i>]]> (group exhibition) Museo
 Rufino Tamayo, Mexico City, November-December, 1984;

<![CDATA[<i>]]>The Artist Celebrates New York: Selected Paintings from The Metropolitan Museum of Art<![CDATA[</i>]]> (group exhibiti
on) Bronx Museum of the Arts, Bronx, February 2-March 24, 1985;

<![CDATA[<i>]]>The Artist Celebrates New York: Selected Paintings from The Metropolitan Museum of Art<![CDATA[</i>]]> (group exhibiti
on) Long Island University, Brooklyn, April 17- June 2, 1985;

<![CDATA[<i>]]>The Artist Celebrates New York: Selected Paintings from The Metropolitan Museum of Art<![CDATA[</i>]]> (group exhibiti
on) Jamaica Arts Center, Queens, June 15-July 27,  1985;

<![CDATA[<i>]]>The Artist Celebrates New York: Selected Paintings from The Metropolitan Museum of Art<![CDATA[</i>]]> (group exhibiti
on) City College of New York, October 25- December 13,1985;

<![CDATA[<i>]]>The 1980s: A New Generation, American Painters and Sculptors<![CDATA[</i>]]> (group exhibition) The Metropolitan Museu
m of Art, New York, April 13-July 31, 1988, brochure no. 48;

<![CDATA[<i>]]>Sweet Oblivion: The Urban Landscape of Martin Wong,<![CDATA[</i>]]> Illinois State University Galleries, Normal, Janua
ry 13-February 22, 1998;

<![CDATA[<i>]]>East Village USA<![CDATA[</i>]]> (group exhibition) New Museum of Contemporary Art, New York, December 9, 2004-March 1
9,2005;

<![CDATA[<i>]]>Martin Wong: Human Instamatic,<![CDATA[</i>]]> Bronx Museum of the Arts, November 4, 2015-March 13, 2016;

<![CDATA[<i>]]>Delirious: Art at the Limits of Reason, 1950-1980<![CDATA[</i>]]> (group exhibition) The Metropolitan Museum of Art, T
he Met Breuer, New York, September 13, 2017-January 14, 2018
  </note>
    <location>
      <url usage="primary display">https://sul-purl-stage.stanford.edu/kd791zq6661</url>
    </location>
  </mods>
</publicObject>

And the purl UI seems to be treating them differently:

Screenshot from 2022-07-06 13-15-04

It appears that we're consistently shipping this data?

cbeer commented 2 years ago

We were expecting encoded line breaks in the XML metadata to avoid confusion about significant and insignificant whitespace. However, with Cocina metadata, all whitespace is treated as significant so it is can't make that distinction in the produced HTML.

Given that, we intend to take an incremental approach (in #120) to apply formatting to abstract and note fields only, until we can do some due diligence to make sure the additional formatting doesn't show up in surprising contexts.