Encoding - Githubissues

nomoregrapes commented 6 years ago

Capturing this for record (almost fixed, then I'll close the ticket).

Special characters were displaying oddly. This is due to how different programs "encode" the files. This is like me verbally telling you the name of a local town but not knowing I'm using the local dialect. If you assume I encode in Southern you will write/display it as "Prudda". if you know I encode in Northern, then you'll understand and write down "Prudhoe". Files don't actually say how they're encoded, a guess has to be made based on any unusual characters found.

The TSV files on OneDrive are encoded in windows-1252 (the default of Microsoft Excel), the ingest now assumes this and converts it to UTF-8 as it reads the file.

peterwebster commented 6 years ago

@nomoregrapes many thanks for getting onto this one. If the ingest encountered a file that was already encoded in UTF-8, what would then happen? I suppose I'm asking how tied we are to windows-1252 from now on.

nomoregrapes commented 6 years ago

I think encoding on the TSV annotation data is sorted. I can't seem to sort figure out the XML files though.

@peterwebster I think your scripts need to explicitly encode the XML files as UTF-8 by starting with the following lines

#!/bin/bash
LC_CTYPE=en_GB.utf8
# all other code here onwards....

(source)

peterwebster commented 6 years ago

Hi @nomoregrapes : thanks for this. Two questions, in order of importance, because this could be a real headache in relation to the XML.

Have you got any examples of where there are visible encoding errors that are coming from the XML? I ask because the only ones I've seen have been cases where there have been entities that were just not defined correctly (the old e acutes ) which is a different issue.
If there is more to the issue than that, then I can work at setting the scripts to include that line to force the output to utf-8. However, I really don't know how to retrospectively deal with the XML we already have. Is there absolutely no way of dealing with the issue as part of the ingest?

peterwebster commented 6 years ago

PS @nomoregrapes is the issue arising because the files themselves don't declare what encoding they have used? And, if so, if you knew what they were in, would that help? I ask because I think there are various tools around for detecting this kind of property, through which the existing XML could be run, but I'll have to ask around a bit.

peterwebster commented 6 years ago

The last version of the XML for v15 before Katie and Hilary started to edit it looked like this. I don't really understand why some should be ASCII and some UTF-8 when they were made by the same script.

peterwebster commented 6 years ago

@nomoregrapes and this is the same volume as it currently stands (though all the filenames have changed, and they're not quite in the same order). A similar sort of mix.

peterwebster commented 6 years ago

See also various possible options here? https://stackoverflow.com/questions/3710374/get-encoding-of-a-file-in-windows @nomoregrapes . Is there scope for building a check like this into the ingest process somehow, if it isn't too much work?

nomoregrapes commented 6 years ago

The main place I notice is in tables with curly-double-quotes, but I think there were other things like certain dashes. 00-11-04 is an example. If my system is strict on reading it as utf-8, then the bytes get read as â€œ instead of “.

The encoding detectors aren't clever enough to figure out what encoding the bytes are in, because there aren't enough unusual characters or probably because there's not a closing quote. I've tried a few different encodings, such as windows-1252 which works for reading the TSVs.

I've not had anything suggest it was ASCII encoded, so I'll set the ingest system up to read it as ascii and see if that works.

peterwebster commented 6 years ago

Hi @nomoregrapes just to note that there shouldn't be any curly doubly quotes in any case, the rule is that they should be straight.

@KPalmerHeathman @DurHHHI could you possibly take a look at volume 15, 00-11-04, and let me have a screenshot of how it looks in the GUI? To save you reading the whole ticket, we're looking for characters that display strangely.

KPalmerHeathman commented 6 years ago

Sorry, you semem to have both screens - probably a breach of GDPR or something....

peterwebster commented 6 years ago

Thanks @KPalmerHeathman : I've got what I needed from that, and have deleted the image (though I'm not sure we had to worry all that much.)

peterwebster commented 6 years ago

Have you guys heard the GDPR joke that was doing the rounds, BTW?

"Do you know a good GDPR consultant you could recommend?" "Yes." "Great, could I have their email address? "No.

@KPalmerHeathman

peterwebster commented 6 years ago

In this particular case, the problem is curly quotes in the table @KPalmerHeathman @DurHHHI

I'm still on the lookout for other characters that appear oddly in the XML, rather than in the annotations

nomoregrapes commented 6 years ago

I think there might be a rails bug that is limiting my ability to convert the encoding from Windows-1252 to UTF-8. If that bug is a problem, it's too deep in the system for me to fix.

Viewing the XML files that are encoded differently: if your program detects how the file has been encoded (as Windows-1252 or ASCII) then it will display the quotes correctly. Text editors and the command line aren't smart enough(they aren't XML readers) to see <?xml version="1.0" encoding="utf-8"?> within the file. They just see read the bytes(0/1s) and perhaps go "oh those bytes look like a Windows-1252 quote mark, I guess these bytes are encoded windows text not encoded utf text". Encoding = how text gets written as 0/1s.

Regardless of anything I (can) do, the best thing is if we tell all our programs explicitly to save/encode as UTF-8. For TextPad I think this is:

Configure Menu --> Preferences --> Document Classes --> Default --> Default encoding --> UTF-8

I'm not noticing as many encoding issues in volume 15. It might be they were last edited by a program that is set to UTF-8.

peterwebster commented 6 years ago

@DurHHHI some time this week, could you spend a few minutes clicking through some XML files in 14 and 15 in the interface, and list where you have characters displaying strangely in the text (not the annotations, that's a separate issue.) I need the filename and a description of which character is showing incorrectly.

peterwebster commented 6 years ago

@nomoregrapes I shall certainly try to set up all scripts and applications to save as UTF-8 from now on. With 14 and 15, I think I just need to figure out which characters are the problem and fix them directly in the XML.

peterwebster commented 6 years ago

@nomoregrapes I'm assigning this to you in case there is anything else you still have to do; if there isn't, pass it back to me.

peterwebster commented 6 years ago

Assigning this to @DurHHHI to pursue the comment three above this one.

See also continuation of another part of this at #113

peterwebster commented 6 years ago

To pursue this once 14 and 15 are ingested in w/c 26 June

peterwebster commented 6 years ago

While in and out of the text in the last weeks of August and then this week, I've not seen any character issues in the XML, so I close this ticket and they can just be fixed in the GUI as and when anyone sees them.

peterwebster / henson

Encoding #89