hidelab / genometranslationcommons

Materiel for the genometranslationcommons.org website.
0 stars 0 forks source link

non-ASCII text in ISA-Tab can become corrupted #14

Open drj11 opened 7 years ago

drj11 commented 7 years ago

For at least one of our ISA-Tab files, text from it is corrupted when displayed.

EG (2017-05-09) https://beta.genometranslationcommons.org/#/preview/e08ed47c-7a9a-4b55-9e06-e5b7e7afd91c

Example is Study Summary display:

corrupt1

Example in Protocols display:

corrupt2

drj11 commented 7 years ago

Culprits are evidently non-ASCII characters. So it's some sort of encoding issue.

drj11 commented 7 years ago
drj11 commented 7 years ago

By inspecting the zip file, it looks like some of the text is encoded in Windows-1252. 0x96 is used for dash (should be a Unicode U+002D HYPHEN-MINUS or U+2212 MINUS SIGN), and 0x91 and 0x92 are used for "smart quotes".

In the screenies above, the Unicode U+FFFD REPLACEMENT CHARACTER (question mark in diamond) appears in those cases.

However, some of the text is irreversibly corrupt in the ISA-tab itself. Instead of -80 we see ?80, and in this case that is an ASCII question mark character (0x3F).

drj11 commented 7 years ago

(from a quick look at the ISA-tab documentation) Seems that ISA-tab does not declare the character encoding.

However, there is a strong recommendation that ISA-tab files be in UTF-8 encoding http://isa-specs.readthedocs.io/en/latest/isatab.html#format:

Files SHOULD be encoded using UTF-8.

drj11 commented 7 years ago

ISA-Tab files should be encoding in UTF-8.

drj11 commented 7 years ago

Diagnosis:

If you have a command line, you can inspect the file encoding using unzip and file.

Example on ISA-Tab known to be okay (from SCC):

$ unzip -p isa_9905_733878.zip 'i_*.txt' | file -
/dev/stdin: ASCII text, with very long lines

Example on ISA-Tab that I modified to include UTF-8:

$ unzip -p isa-unicode.zip 'i_*.txt' | file -
/dev/stdin: UTF-8 Unicode text, with very long lines

Example in this issue, that displays incorrectly on GTC:

$ unzip -p john_archive_3_CmphI1W.zip 'i_*.txt' | file -
/dev/stdin: Non-ISO extended-ASCII text, with very long lines, with CRLF, LF line terminators
drj11 commented 7 years ago

It may help to start ISA-Creator with the file.encoding set to utf-8.

You will need to modify this command line, but the important bit is -Dfile.encoding=utf-8:

java -jar -Dfile.encoding=utf-8 /Applications/ISAcreator-1.7/ISAcreator.app/Contents/Resources/Java/ISAcreator.jar

(that suggestion was from https://groups.google.com/forum/#!topic/isaforum/03P91ZQ1mj0)