ualbertalib / HydraNorth

This repo is deprecated. Succeeded by https://github.com/ualbertalib/jupiter. This codebase was a IR built based on Samvera/Sufia
11 stars 4 forks source link

escaped ampersand in metadata #658

Closed johnhuck closed 9 years ago

johnhuck commented 9 years ago

https://hydranorth.library.ualberta.ca/catalog?f%5Bsubject_sim%5D%5B%5D=Rites+%26amp%3B+ceremonies

johnhuck commented 9 years ago

Example: https://hydranorth.library.ualberta.ca/files/x346d4165#.Vg8IWipViko

pbinkley commented 9 years ago

The foxml used for the migration has <dcterms:subject>Rites &amp; ceremonies</dcterms:subject>. But encoded ampersands in other fields are being imported correctly, e.g. dcterms:title in https://newport.library.ualberta.ca/files/gb19f580q#.Vg8Mpd9zjJ8 from <dcterms:title>“Nowhere to Turn, Nowhere to Go”: Library &amp; Information Services for Sexual &amp; Gender (LGBTQ) Minorities</dcterms:title>. But here's one where a committee member listing had an ampersand that was not decoded: https://newport.library.ualberta.ca/files/4m90dv490#.Vg8ORN9zjJ8 . So perhaps it's multivalue fields?

pbinkley commented 9 years ago

Rails should be taking care of this for us:

irb(main):001:0> xml = "<a>foo&amp;bar</a>"
=> "<a>foo&amp;bar</a>"
irb(main):002:0> dom = Nokogiri.XML(xml)
=> #<Nokogiri::XML::Document:0x529b774 name="document" children=[#<Nokogiri::XML::Element:0x529b3f0 name="a" children=[#<Nokogiri::XML::Text:0x529b1d4 "foo&bar">]>]>
irb(main):026:0> t = dom.xpath("a/text()",NS)
=> [#<Nokogiri::XML::Text:0x529b1d4 "foo&bar">]

but

irb(main):027:0> t = dom.xpath("a/text()",NS).map(&:to_s)
=> ["foo&amp;bar"]

Which explains why single-value fields are unaffected. We need to use Nokogiri's text method:

irb(main):028:0> t = dom.xpath("a/text()",NS).map(&:text)
=> ["foo&bar"]

Testing that now.