ucsdlib / damsmanager

DAMS Manager
Other
3 stars 1 forks source link

Resolve encoding/display errors on 3 Catalhoyuk Digipres staging objects #250

Closed remerjohnson closed 5 years ago

remerjohnson commented 5 years ago

Descriptive summary

Despite a successful test ingest log (damslog-2099381761.txt) of 13 objects, only 10 objects are visible in the Catalhoyuk Digipres Test Collection.

@lsitu reported that:

I found there is an encoding issue with the special character ETX. There is an error com.hp.hpl.jena.shared.CannotEncodeCharacterException: cannot encode (char) ETX in context XML while retrieving the RDF XML.

The source OLR metadata: OLR_catalhoyuk_digital_preservation.xlsx

Expected behavior

All 13 objects should appear

Actual behavior

Only 10 objects appear

I am labeling this high priority as per @hjsyoo discussions. Thanks

hjsyoo commented 5 years ago

By any chance, would saving the XLSX file as CSV, then saving again as XLSX remove all encoded characters? We would then need to replace all instances of Catalhoyuk with the properly MS-formatted version of the word. Not that this would address the cause of the original problem, but is this even a possible quick fix?

lsitu commented 5 years ago

@hjsyoo / @remerjohnson I am not sure whether that XLSX file as CSV conversion will fix the issue or not but I do found and encoding character 1\3 with the source metadata while testing and inspecting with the original Excel source that Ryan attached. It exists in the Note:description column of those three records in questions in row#159, row#253, and row#308. For example:

Northern wall of space 229. Plastered wall which had 2 posts attached to it, as shown in the plan, one about 1\3 of the way along from the w corner the other in the e corner. The wall and plaster of the eastern end were truncated by a late E-W burial. The plaster surfaces are at least 20mmm thick and thus represent many plastering episodes. Visible in some areas are traces of black paint, present in different layers of the plaster.

I think encoding \3 is the unique code for the control character ETX(END OF TEXT), which trigger the com.hp.hpl.jena.shared.CannotEncodeCharacterException: cannot encode (char) ETX in context XML while retrieving the RDF XML. error while retrieving the RDF from the triplestore.

So I'll suggest correct the encoding 1\3 to 1/3 as a quick fix for now. While moving forward, would you like the Excel Import tool to ignore those control characters or not? Or just ignore/replace them while generating the RDF to avoid the errorcom.hp.hpl.jena.shared.CannotEncodeCharacterException above?

remerjohnson commented 5 years ago

@lsitu Ah, that's great Longshou. I found some suspect line breaks, but that wasn't the real issue, and couldn't find other suspects. I can see how a slash like that would not be kosher in Excel. I'll try that fix and report back if that resolves the issue.

I could see this happening again if someone were putting Windows-style file paths in for some reason, but I don't anticipate that realistically. But, it may be good to ignore such characters going forward. What do you think @hjsyoo @mcritchlow

hjsyoo commented 5 years ago

Thanks for catching that so quickly, @lsitu. I tentatively agree with Ryan that ignoring the chars going forward might be best. It's hard to anticipate every data file coming in, but ignoring sounds safer than replacing.

mcritchlow commented 5 years ago

Nice catch @lsitu! I'm not as familiar with the Jena parsing library as Longshou is, but I agree that ideally we would ignore or otherwise escape characters that can't be encoded.

remerjohnson commented 5 years ago

Going to close this ticket out for now since it's been resolved. Do we need another ticket for ignoring invalid characters like these? @lsitu @mcritchlow

lsitu commented 5 years ago

@remerjohnson I am fine with either way, work on this ticket or open a new one. Thanks.

lsitu commented 5 years ago

@remerjohnson Will you open a new ticket for handling those control characters, or reopen this ticket for me to work on? Thanks.