Closed awead closed 9 years ago
When viewing the descMetadata datastream via the Fedora3 web admin:
Objectives:\r\n•
When looking at it from console:
Objectives:\r\nâ\u0080¢
However, note that the contents appear the same when viewing their source
and target
from within FedoraMigrate::ObjectMover
:
(byebug) source.datastreams["descMetadata"].content.split(/\n/)[5]
"<info:fedora/scholarsphere:7d279232g> <http://purl.org/dc/terms/description> \"Objectives:\\r\\nâ\u0080¢ Explain the role of a new genomic assay (Target Nowâ\u0084¢) in guiding oncology treatment plans.\\r\\nâ\u0080¢ Describe the Target Nowâ\u0084¢ assay.\\r\\nâ\u0080¢ Present a case study where Target Nowâ\u0084¢ was instrumental in the patientâ\u0080\u0099s treatment plan.\" ."
(byebug) mover.target.description.first
"Objectives:\r\nâ\u0080¢ Explain the role of a new genomic assay (Target Nowâ\u0084¢) in guiding oncology treatment plans.\r\nâ\u0080¢ Describe the Target Nowâ\u0084¢ assay.\r\nâ\u0080¢ Present a case study where Target Nowâ\u0084¢ was instrumental in the patientâ\u0080\u0099s treatment plan."
All of the above appears consistent, give than \u0080
is the Euro symbol according to:
http://www.fileformat.info/info/unicode/char/0080/index.htm
under "Java Data"
@mjgiarlo @jcoyne This has me stumped. The problem is the character data is coming out of Rubydora looking this way and FedoraMigrate appears to be faithfully replicating it... warts and all.
It looks like something may be forcing an 8859 encoding and then encoding back to UTF-8. Here's the original string:
irb(main):036:0> original_string
=> "Objectives:\r\n• Explain the role of a new genomic assay (Target Now™) in guiding oncology treatment plans.\r\n• Describe the Target Now™ assay.\r\n• Present a case study where Target Now™ was instrumental in the patient’s treatment plan."
irb(main):037:0> puts original_string
Objectives:
• Explain the role of a new genomic assay (Target Now™) in guiding oncology treatment plans.
• Describe the Target Now™ assay.
• Present a case study where Target Now™ was instrumental in the patient’s treatment plan.
=> nil
irb(main):038:0> original_string.encoding
=> #<Encoding:UTF-8>
irb(main):039:0> original_string.bytes
=> [79, 98, 106, 101, 99, 116, 105, 118, 101, 115, 58, 13, 10, 226, 128, 162, 32, 32, 69, 120, 112, 108, 97, 105, 110, 32, 116, 104, 101, 32, 114, 111, 108, 101, 32, 111, 102, 32, 97, 32, 110, 101, 119, 32, 103, 101, 110, 111, 109, 105, 99, 32, 97, 115, 115, 97, 121, 32, 40, 84, 97, 114, 103, 101, 116, 32, 78, 111, 119, 226, 132, 162, 41, 32, 105, 110, 32, 103, 117, 105, 100, 105, 110, 103, 32, 111, 110, 99, 111, 108, 111, 103, 121, 32, 116, 114, 101, 97, 116, 109, 101, 110, 116, 32, 112, 108, 97, 110, 115, 46, 13, 10, 226, 128, 162, 32, 32, 68, 101, 115, 99, 114, 105, 98, 101, 32, 116, 104, 101, 32, 84, 97, 114, 103, 101, 116, 32, 78, 111, 119, 226, 132, 162, 32, 97, 115, 115, 97, 121, 46, 13, 10, 226, 128, 162, 32, 32, 80, 114, 101, 115, 101, 110, 116, 32, 97, 32, 99, 97, 115, 101, 32, 115, 116, 117, 100, 121, 32, 119, 104, 101, 114, 101, 32, 84, 97, 114, 103, 101, 116, 32, 78, 111, 119, 226, 132, 162, 32, 119, 97, 115, 32, 105, 110, 115, 116, 114, 117, 109, 101, 110, 116, 97, 108, 32, 105, 110, 32, 116, 104, 101, 32, 112, 97, 116, 105, 101, 110, 116, 226, 128, 153, 115, 32, 116, 114, 101, 97, 116, 109, 101, 110, 116, 32, 112, 108, 97, 110, 46]
And here's how things look when I force 8859 then encode as UTF-8:
irb(main):044:0> forced = original_string.force_encoding(Encoding::ISO8859_1).encode(Encoding::UTF_8)
=> "Objectives:\r\nâ\u0080¢ Explain the role of a new genomic assay (Target Nowâ\u0084¢) in guiding oncology treatment plans.\r\nâ\u0080¢ Describe the Target Nowâ\u0084¢ assay.\r\nâ\u0080¢ Present a case study where Target Nowâ\u0084¢ was instrumental in the patientâ\u0080\u0099s treatment plan."
irb(main):045:0> puts forced
Objectives:
⢠Explain the role of a new genomic assay (Target Nowâ¢) in guiding oncology treatment plans.
⢠Describe the Target Now⢠assay.
⢠Present a case study where Target Now⢠was instrumental in the patientâs treatment plan.
=> nil
irb(main):046:0> forced.encoding
=> #<Encoding:UTF-8>
irb(main):047:0> forced.bytes
=> [79, 98, 106, 101, 99, 116, 105, 118, 101, 115, 58, 13, 10, 195, 162, 194, 128, 194, 162, 32, 32, 69, 120, 112, 108, 97, 105, 110, 32, 116, 104, 101, 32, 114, 111, 108, 101, 32, 111, 102, 32, 97, 32, 110, 101, 119, 32, 103, 101, 110, 111, 109, 105, 99, 32, 97, 115, 115, 97, 121, 32, 40, 84, 97, 114, 103, 101, 116, 32, 78, 111, 119, 195, 162, 194, 132, 194, 162, 41, 32, 105, 110, 32, 103, 117, 105, 100, 105, 110, 103, 32, 111, 110, 99, 111, 108, 111, 103, 121, 32, 116, 114, 101, 97, 116, 109, 101, 110, 116, 32, 112, 108, 97, 110, 115, 46, 13, 10, 195, 162, 194, 128, 194, 162, 32, 32, 68, 101, 115, 99, 114, 105, 98, 101, 32, 116, 104, 101, 32, 84, 97, 114, 103, 101, 116, 32, 78, 111, 119, 195, 162, 194, 132, 194, 162, 32, 97, 115, 115, 97, 121, 46, 13, 10, 195, 162, 194, 128, 194, 162, 32, 32, 80, 114, 101, 115, 101, 110, 116, 32, 97, 32, 99, 97, 115, 101, 32, 115, 116, 117, 100, 121, 32, 119, 104, 101, 114, 101, 32, 84, 97, 114, 103, 101, 116, 32, 78, 111, 119, 195, 162, 194, 132, 194, 162, 32, 119, 97, 115, 32, 105, 110, 115, 116, 114, 117, 109, 101, 110, 116, 97, 108, 32, 105, 110, 32, 116, 104, 101, 32, 112, 97, 116, 105, 101, 110, 116, 195, 162, 194, 128, 194, 153, 115, 32, 116, 114, 101, 97, 116, 109, 101, 110, 116, 32, 112, 108, 97, 110, 46]
This is what we're doing here, right? https://github.com/projecthydra-labs/fedora-migrate/blob/master/lib/fedora_migrate/rdf_datastream_mover.rb#L33
@mjgiarlo yes, this is what @jcoyne did to correct for objects that were throwing RDF errors. See #9
This item in Fedora3 has this description: https://scholarsphere.psu.edu/files/7d279232g#.VMfeyHDF_iE
When converted, it displays as:
https://scholarsphere-qa.dlt.psu.edu/files/7d279232g#.VMfetXDF_iE
![7d279232g_migrated](https://cloud.githubusercontent.com/assets/312085/5924787/89a2fde0-a62c-11e4-82c0-88e58436fb70.png)