Closed EliseTemple closed 8 years ago
@EliseTemple This is a problem with the feed coming out of CPP's Omeka. Specifically with their relation field. The example record in question:
<record>
<header>
<identifier>oai:oai.cppdigitallibrary.org:4376</identifier>
<datestamp>2016-03-02T17:26:18Z</datestamp>
<setSpec>4</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Michelle the Choking Doll</dc:title>
<dc:creator>Jackson, Chevalier, 1865-1958</dc:creator>
<dc:subject>Laryngoscopy</dc:subject>
<dc:subject>Otolaryngology</dc:subject>
<dc:subject>Jackson, Chevalier, 1865-1958</dc:subject>
<dc:description>Pioneering otolaryngologist Chevalier Jackson, MD (1865–1958), used this doll, named “Michelle,” to demonstrate his non-surgical techniques for removing foreign objects from the throats of children. Jackson’s longtime French assistant Angele Piquenais sewed Michelle, who simulates a small patient with a child-sized trachea and esophagus. Jackson also once demonstrated an emergency tracheotomy on Michelle, an event documented on home movie film; her throat still shows the scar. Watch the film if you like.
Jackson was world-renowned for his skill in the rapid use of endoscopic instruments to remove inhaled and swallowed foreign bodies without anesthesia, which greatly reduced the risks of the procedure. He combined his technical skill with a bedside manner that could keep distressed young patients calm. Jackson developed many specialized instruments and techniques for removing swallowed or inhaled objects, and could extract safety pins, nails, broken glass, and other dangerous objects without injuring the patient. His advanced techniques also enabled him to perform surgery to repair damage, such as removing scar tissue from accidental swallowing of caustic materials.
Jackson wrote that his father’s advice to “educate the eye and the fingers” spurred him to “continuous effort” in refining and improving the techniques of laryngoscopy. As a professor at medical schools including the University of Pittsburgh, Jefferson Medical College, and Temple University, Jackson also sought to educate the eyes and fingers of many medical students. By one estimate, students he personally trained saved as many as half a million lives using his techniques.</dc:description>
<dc:publisher>Digitized by the Mütter Museum of The College of Physicians of Philadelphia</dc:publisher>
<dc:contributor>Judy Comes, Donor</dc:contributor>
<dc:type>StillImage</dc:type>
<dc:format>image/JPEG</dc:format>
<dc:identifier>2014.9.1</dc:identifier>
<dc:identifier>http://www.cppdigitallibrary.org/items/show/4376</dc:identifier>
<dc:identifier xsi:type="original">http://www.cppdigitallibrary.org/files/original/917dcb60699fa912c95370b36a9bf04b.jpg</dc:identifier>
<dc:identifier xsi:type="thumbnail">http://www.cppdigitallibrary.org/files/thumbnails/917dcb60699fa912c95370b36a9bf04b.jpg</dc:identifier>
<dc:relation>M&uuml;tter Museum</dc:relation>
<dc:relation>Memento M&uuml;tter</dc:relation>
<dc:rights>This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License</dc:rights>
</oai_dc:dc>
</metadata>
</record>
You can see that Mütter Museum is represented correctly in the Published field, but the relation field has the "Mütter Museum" representation, which is essentially undecoded HTML.
Doing the decoding on our end would be possible, but certainly not ideal. Could be unintended consequences on other data. CPP cleaning their own data would be preferred.
thank you for this.
What might the unintended consequences be? and How big of a fix is this to do from a programming perspective? I'm trying to assess which path would be the least disruptive.
They have pretty extensively used html links within a few of their fields. Asking them to take out these links is not just a huge endeavor (which is okay), but it will also take away functionality on their site
Ah - never mind - I just realized what you said - we can't really fix it on our end at all.
No, we could possibly fix it. We could run a decode on their fields that have the HTML links. Actually we'd have to run decode on it twice, since the &
is itself already encoded. That means adding a special subroutine just for CPP data. Down the path of 'special code per seed / provider' lies madness and un-maintainable code. And more to the point, special code to address the fact that they hacked around an issue instead of dealing with it the right way (updating templates) also sounds like a bad idea.
The possible side effects (and why I'd want to limit the decode to their collections) is that some data that are not intended to be decoded would be decoded. For instance anything that includes a ampersand and is semi-colon delimited is suspect. Unlikely, but I do not know the extent of our data.
Thank you for this more extensive explanation. I was not aware of these intricacies, or that there was a better practice for them to practice in order to add links to fields. I will try to relay this information.
Again, just to make clear - we could do it as a temporary measure, limited to a single field (relation) until they can fix that issue. It would not take long to implement. A few hours on our end. My hesitation is just the just trying to keep the special cases under control.
From Tristan @ CPP
I've solved the issue of the XML breaking. Turns out they released a patch for the plugin in February addressing just that! Now all the collections should be able to harvested while maintaining the styling for our site.
I'm trying to figure out how to fix the diacritics showing up as html character codes in the item metadata fields. From some communication I've had with one of the Omeka dev folks on the forums, it seems as if its a condition of the the wysiwyg editor in Omeka. I should be able to configure a global parameter to stop the conversion, but have yet to figure it out. I'll let you know if/when I do.
Diacritics are not coming through correctly for the College of Physicians. For example, “Mütter” becomes “Mü, Tter”. I found this in most collections, so I assume this is a system-wide issue.
Tristan from the College of Physicians believes that the diacritics only do this when there was an html link in the metadata.
Example (see in Description and Relation)
Our next crawl is June - we want to have this collection cleaned up by then so they can go live in the DPLA.