samvera-labs / fedora-migrate

Gem for migrating content to Fedora4
Other
10 stars 7 forks source link

Possible UTF-8 conversion problems #23

Closed awead closed 9 years ago

awead commented 9 years ago

This item in Fedora3 has this description: https://scholarsphere.psu.edu/files/7d279232g#.VMfeyHDF_iE 7d279232g_prod When converted, it displays as: https://scholarsphere-qa.dlt.psu.edu/files/7d279232g#.VMfetXDF_iE 7d279232g_migrated

awead commented 9 years ago

When viewing the descMetadata datastream via the Fedora3 web admin:

Objectives:\r\n•

When looking at it from console:

Objectives:\r\nâ\u0080¢

However, note that the contents appear the same when viewing their source and target from within FedoraMigrate::ObjectMover:

(byebug) source.datastreams["descMetadata"].content.split(/\n/)[5]
"<info:fedora/scholarsphere:7d279232g> <http://purl.org/dc/terms/description> \"Objectives:\\r\\nâ\u0080¢  Explain the role of a new genomic assay (Target Nowâ\u0084¢) in guiding oncology treatment plans.\\r\\nâ\u0080¢  Describe the Target Nowâ\u0084¢ assay.\\r\\nâ\u0080¢  Present a case study where Target Nowâ\u0084¢ was instrumental in the patientâ\u0080\u0099s treatment plan.\" ."
(byebug) mover.target.description.first
"Objectives:\r\nâ\u0080¢  Explain the role of a new genomic assay (Target Nowâ\u0084¢) in guiding oncology treatment plans.\r\nâ\u0080¢  Describe the Target Nowâ\u0084¢ assay.\r\nâ\u0080¢  Present a case study where Target Nowâ\u0084¢ was instrumental in the patientâ\u0080\u0099s treatment plan."

All of the above appears consistent, give than \u0080 is the Euro symbol according to: http://www.fileformat.info/info/unicode/char/0080/index.htm under "Java Data"

awead commented 9 years ago

@mjgiarlo @jcoyne This has me stumped. The problem is the character data is coming out of Rubydora looking this way and FedoraMigrate appears to be faithfully replicating it... warts and all.

mjgiarlo commented 9 years ago

It looks like something may be forcing an 8859 encoding and then encoding back to UTF-8. Here's the original string:

irb(main):036:0> original_string
=> "Objectives:\r\n•  Explain the role of a new genomic assay (Target Now™) in guiding oncology treatment plans.\r\n•  Describe the Target Now™ assay.\r\n•  Present a case study where Target Now™ was instrumental in the patient’s treatment plan."
irb(main):037:0> puts original_string
Objectives:
•  Explain the role of a new genomic assay (Target Now™) in guiding oncology treatment plans.
•  Describe the Target Now™ assay.
•  Present a case study where Target Now™ was instrumental in the patient’s treatment plan.
=> nil
irb(main):038:0> original_string.encoding
=> #<Encoding:UTF-8>
irb(main):039:0> original_string.bytes
=> [79, 98, 106, 101, 99, 116, 105, 118, 101, 115, 58, 13, 10, 226, 128, 162, 32, 32, 69, 120, 112, 108, 97, 105, 110, 32, 116, 104, 101, 32, 114, 111, 108, 101, 32, 111, 102, 32, 97, 32, 110, 101, 119, 32, 103, 101, 110, 111, 109, 105, 99, 32, 97, 115, 115, 97, 121, 32, 40, 84, 97, 114, 103, 101, 116, 32, 78, 111, 119, 226, 132, 162, 41, 32, 105, 110, 32, 103, 117, 105, 100, 105, 110, 103, 32, 111, 110, 99, 111, 108, 111, 103, 121, 32, 116, 114, 101, 97, 116, 109, 101, 110, 116, 32, 112, 108, 97, 110, 115, 46, 13, 10, 226, 128, 162, 32, 32, 68, 101, 115, 99, 114, 105, 98, 101, 32, 116, 104, 101, 32, 84, 97, 114, 103, 101, 116, 32, 78, 111, 119, 226, 132, 162, 32, 97, 115, 115, 97, 121, 46, 13, 10, 226, 128, 162, 32, 32, 80, 114, 101, 115, 101, 110, 116, 32, 97, 32, 99, 97, 115, 101, 32, 115, 116, 117, 100, 121, 32, 119, 104, 101, 114, 101, 32, 84, 97, 114, 103, 101, 116, 32, 78, 111, 119, 226, 132, 162, 32, 119, 97, 115, 32, 105, 110, 115, 116, 114, 117, 109, 101, 110, 116, 97, 108, 32, 105, 110, 32, 116, 104, 101, 32, 112, 97, 116, 105, 101, 110, 116, 226, 128, 153, 115, 32, 116, 114, 101, 97, 116, 109, 101, 110, 116, 32, 112, 108, 97, 110, 46]

And here's how things look when I force 8859 then encode as UTF-8:

irb(main):044:0> forced = original_string.force_encoding(Encoding::ISO8859_1).encode(Encoding::UTF_8)
=> "Objectives:\r\nâ\u0080¢  Explain the role of a new genomic assay (Target Nowâ\u0084¢) in guiding oncology treatment plans.\r\nâ\u0080¢  Describe the Target Nowâ\u0084¢ assay.\r\nâ\u0080¢  Present a case study where Target Nowâ\u0084¢ was instrumental in the patientâ\u0080\u0099s treatment plan."
irb(main):045:0> puts forced
Objectives:
•  Explain the role of a new genomic assay (Target Now™) in guiding oncology treatment plans.
•  Describe the Target Now™ assay.
•  Present a case study where Target Now™ was instrumental in the patient’s treatment plan.
=> nil
irb(main):046:0> forced.encoding
=> #<Encoding:UTF-8>
irb(main):047:0> forced.bytes
=> [79, 98, 106, 101, 99, 116, 105, 118, 101, 115, 58, 13, 10, 195, 162, 194, 128, 194, 162, 32, 32, 69, 120, 112, 108, 97, 105, 110, 32, 116, 104, 101, 32, 114, 111, 108, 101, 32, 111, 102, 32, 97, 32, 110, 101, 119, 32, 103, 101, 110, 111, 109, 105, 99, 32, 97, 115, 115, 97, 121, 32, 40, 84, 97, 114, 103, 101, 116, 32, 78, 111, 119, 195, 162, 194, 132, 194, 162, 41, 32, 105, 110, 32, 103, 117, 105, 100, 105, 110, 103, 32, 111, 110, 99, 111, 108, 111, 103, 121, 32, 116, 114, 101, 97, 116, 109, 101, 110, 116, 32, 112, 108, 97, 110, 115, 46, 13, 10, 195, 162, 194, 128, 194, 162, 32, 32, 68, 101, 115, 99, 114, 105, 98, 101, 32, 116, 104, 101, 32, 84, 97, 114, 103, 101, 116, 32, 78, 111, 119, 195, 162, 194, 132, 194, 162, 32, 97, 115, 115, 97, 121, 46, 13, 10, 195, 162, 194, 128, 194, 162, 32, 32, 80, 114, 101, 115, 101, 110, 116, 32, 97, 32, 99, 97, 115, 101, 32, 115, 116, 117, 100, 121, 32, 119, 104, 101, 114, 101, 32, 84, 97, 114, 103, 101, 116, 32, 78, 111, 119, 195, 162, 194, 132, 194, 162, 32, 119, 97, 115, 32, 105, 110, 115, 116, 114, 117, 109, 101, 110, 116, 97, 108, 32, 105, 110, 32, 116, 104, 101, 32, 112, 97, 116, 105, 101, 110, 116, 195, 162, 194, 128, 194, 153, 115, 32, 116, 114, 101, 97, 116, 109, 101, 110, 116, 32, 112, 108, 97, 110, 46]

This is what we're doing here, right? https://github.com/projecthydra-labs/fedora-migrate/blob/master/lib/fedora_migrate/rdf_datastream_mover.rb#L33

awead commented 9 years ago

@mjgiarlo yes, this is what @jcoyne did to correct for objects that were throwing RDF errors. See #9