Closed arcadiafalcone closed 2 years ago
Argo data (what shows as line breaks was entered as 


- closing the datastream converted the character to an actual line break):
Desired display (taken from purl):
The line break character should display in the MODS XML accessible via purl. I think it would also make sense to display the character in Argo rather than converting to an actual line break.
Further investigation: Check how replayable spreadsheet encodes line breaks within cells (Excel, Google Sheets, CSV).
To reduce confusion and error on the user side, it would be preferable to recognize line break characters generated by applications commonly used to create metadata for the SDR, primarily spreadsheet applications. So far as I can tell, Microsoft Excel, Mac Excel, and Google Sheets when converted to CSV all represent line breaks within a value as the linefeed character (\n
). Regardless of what character is used to represent line breaks internally, a linefeed character in input should be recognized as a line break.
Examples from Martin Wong collection Google Sheet: nd259gr6577 has line breaks in note with type "exhibitions" xh643gr2981 has line breaks in abstract
Martin Wong collection MD Google sheet: https://docs.google.com/spreadsheets/d/1IdFPoiv6BPJwX1cHZyt2oskY1e6mVX_p6brbhNrZyAw/edit#gid=0
my understanding is this has (maybe) two parts?
\n
) get converted to 

in the metadata

into line breaks from purl (where it works correctly) and ports it to any other properties that need to display the line breaksHi, all! I'm documenting what all work this and sul-dlss/exhibits#1981 needs on the @sul-dlss/infrastructure-team side, hoping to have it written up early next work week for @vivnwong to review (so that work can be resourced, scheduled, and prioritized).
@arcadiafalcone Should we (or, I guess, modsulator in this case?) only convert linefeeds to 

for Martin Wong-related work or for all data? If the latter, do you have thoughts on how we should identify spreadsheets as being affiliated with this collection?
@mjgiarlo For all data, not just Martin Wong.
@arcadiafalcone OK, thanks! And I have now exported the MW google sheet as an xlsx to test running through modsulator.
@arcadiafalcone @thatbudakguy @caaster @vivnwong AFAICT, this change shouldn't require much development work given that the stanford-mods-normalizer gem already does this:
https://github.com/sul-dlss/mods_normalizer/blob/master/lib/stanford/mods/normalizer.rb#L58-L82
A problem currently is that modsulator is stripping out control characters before the stanford-mods-normalizer gem runs ☝🏻 code:
https://github.com/sul-dlss/modsulator-app-rails/blob/main/app/models/modsulator_sheet.rb#L31
This line might also work against us, which needs testing:
https://github.com/sul-dlss/modsulator-app-rails/blob/main/app/lib/modsulator.rb#L104
Excluding time from @arcadiafalcone (e.g., help with testing in QA/stage?), I suspect this change would require no more than 1-2 developer days' worth of work.
@arcadiafalcone Should this be restricted to certain fields or should line breaks be allowed in all fields?
@mjgiarlo Current use cases are for <note>
and <abstract>
(either at the top level or nested under <relatedItem>
).
@arcadiafalcone I'm looking at nd259gr6577 right now (in the context of exhibits), and it looks like there's a single note with "exhibitions" type where the values are semicolon-separated. i think if there were multiple notes with the "exhibitions" type, as there are for "related publications", we'd get the semantically correct multiple <dd>
s, which would display how we want:
does this make sense?
I see how that could work for exhibition history, though I'd have to do a data check with the Martin Wong folks. I know there is at least one note that quotes a stanza of poetry that it wouldn't make sense to break up in that way.
Anneliis and I tested this in argo-prod today and weren't able to generate line breaks in the display. All testing of line breaks was done in notes with display label "Exhibition history."
Behavior 1) Download cocina spreadsheet for five sample druids 2) Upload CSV to Google sheets 3a) Add line breaks to exhibition history notes by using cmd/ctrl + Enter 3b) For druid pd464yj1635, we also tried adding the string "\n" to the end of each line (in addition to the visible line breaks) 4) Confirmed that objects completed accessioning 5) Viewed purls 6) Line breaks did not display (but italics did!!!) 7) Checked the cocina in Argo - pd464yj1635 represents the line breaks as "\n", vh115kc5665 as "\n"
Data Druids: druid:pd464yj1635 druid:vh115kc5665 druid:ns768mm2958 druid:qr959jx6172 druid:mg552nh0170
Test spreadsheet: martin_wong_display_test.csv
Additional context I've also noticed some odd behavior in how non-Martin Wong records with line breaks display on purl (abstracts in Hydrus objects). 1) https://purl.stanford.edu/bb782yf9388 has line breaks, but also displays escaped characters. The cocina has line breaks as "\r\n". 2) https://purl.stanford.edu/bb490zh2544 has line breaks (visible when resizing window). The cocina has line breaks as "\n" (plus one in a note as "\r\n")
https://github.com/sul-dlss/purl/pull/594 will fix theof linebreaks for bb782yf9388. Unless I'm missing something, that's also the only record I see that encoded intentional line breaks with the expected

(although bb782yf9388 seems to now use the decimal encoding
.. which I guess ought to be fine too?)
purl v4.3.2, which is now live, does seem to have fixed bb782yf9388.
@cbeer @thatbudakguy @arcadiafalcone
AFAICT, this test object in stage (which has snippets of MW data) uses the same newline approach for both abstracts and notes:
$ curl https://sul-purl-stage.stanford.edu/kd791zq6661.xml
XML snippet:
<?xml version="1.0" encoding="UTF-8"?>
<publicObject id="druid:kd791zq6661" published="2022-07-06T20:07:08Z" publishVersion="cocina-models/0.82.0">
<mods xmlns="http://www.loc.gov/mods/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:rdf="http://www.w3.org/1999/02
/22-rdf-syntax-ns#" xmlns:xlink="http://www.w3.org/1999/xlink" version="3.7" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://ww
w.loc.gov/standards/mods/v3/mods-3-7.xsd">
<titleInfo>
<title>Testing italics and newlines again</title>
</titleInfo>
<abstract>I am in a <![CDATA[<cite>]]>CSV desc upload<![CDATA[</cite>]]> file! Breaks surround me.
OK, now we shall test <![CDATA[<i>]]>ITALICS<![CDATA[</i>]]></abstract>
<note displayLabel="Exhibition history" type="exhibitions"><![CDATA[<i>]]>Situations<![CDATA[</i>]]> (group exhibition) Bess Cutl
er Gallery, New York, January-February 11, 1984;
<![CDATA[<i>]]>Aspects of the City<![CDATA[</i>]]> (group exhibition) The Metropolitan Museum of Art, New York, July 31-September 30,
1984;
<![CDATA[<i>]]>Nueva Pintura Narrativa: Coleccion del Metropolitan Museum of Art, Nueva York<![CDATA[</i>]]> (group exhibition) Museo
Rufino Tamayo, Mexico City, November-December, 1984;
<![CDATA[<i>]]>The Artist Celebrates New York: Selected Paintings from The Metropolitan Museum of Art<![CDATA[</i>]]> (group exhibiti
on) Bronx Museum of the Arts, Bronx, February 2-March 24, 1985;
<![CDATA[<i>]]>The Artist Celebrates New York: Selected Paintings from The Metropolitan Museum of Art<![CDATA[</i>]]> (group exhibiti
on) Long Island University, Brooklyn, April 17- June 2, 1985;
<![CDATA[<i>]]>The Artist Celebrates New York: Selected Paintings from The Metropolitan Museum of Art<![CDATA[</i>]]> (group exhibiti
on) Jamaica Arts Center, Queens, June 15-July 27, 1985;
<![CDATA[<i>]]>The Artist Celebrates New York: Selected Paintings from The Metropolitan Museum of Art<![CDATA[</i>]]> (group exhibiti
on) City College of New York, October 25- December 13,1985;
<![CDATA[<i>]]>The 1980s: A New Generation, American Painters and Sculptors<![CDATA[</i>]]> (group exhibition) The Metropolitan Museu
m of Art, New York, April 13-July 31, 1988, brochure no. 48;
<![CDATA[<i>]]>Sweet Oblivion: The Urban Landscape of Martin Wong,<![CDATA[</i>]]> Illinois State University Galleries, Normal, Janua
ry 13-February 22, 1998;
<![CDATA[<i>]]>East Village USA<![CDATA[</i>]]> (group exhibition) New Museum of Contemporary Art, New York, December 9, 2004-March 1
9,2005;
<![CDATA[<i>]]>Martin Wong: Human Instamatic,<![CDATA[</i>]]> Bronx Museum of the Arts, November 4, 2015-March 13, 2016;
<![CDATA[<i>]]>Delirious: Art at the Limits of Reason, 1950-1980<![CDATA[</i>]]> (group exhibition) The Metropolitan Museum of Art, T
he Met Breuer, New York, September 13, 2017-January 14, 2018
</note>
<location>
<url usage="primary display">https://sul-purl-stage.stanford.edu/kd791zq6661</url>
</location>
</mods>
</publicObject>
And the purl UI seems to be treating them differently:
It appears that we're consistently shipping this data?
We were expecting encoded line breaks in the XML metadata to avoid confusion about significant and insignificant whitespace. However, with Cocina metadata, all whitespace is treated as significant so it is can't make that distinction in the produced HTML.
Given that, we intend to take an incremental approach (in #120) to apply formatting to abstract and note fields only, until we can do some due diligence to make sure the additional formatting doesn't show up in surprising contexts.
Desired behavior: The user enters


as part of a value in a replayable spreadsheet or MODS file that is uploaded to Argo, or by editing the datastream in Argo directly. This may appear in thenote
orabstract
element, possibly nested underrelatedItem
. The display in purl, SearchWorks, and Spotlight (and eventually the data catalog) interprets

as a line break. In Spotlight, the line break should appear both on the item show page and in the "More details" view.See below for update.