lutaml / expressir

Ruby parser for the ISO EXPRESS language
3 stars 2 forks source link

Ensure all text is processed as UTF-8 #44

Closed opoudjis closed 3 years ago

opoudjis commented 3 years ago

I am assuming this is an issue with expressir, but from my distant vantage point in Metanorma, it is hard for me to tell.

The document in https://github.com/metanorma/annotated-express/blob/master/data/resources/action_schema/action_schema.exp is processed by expressir, and then has its parse passed on by lutaml to metanorma

Metanorma assumes all files it is processing are in UTF-8.

Lutaml, I am assured by @w00lf, processes all its files in UTF-8.

The action_schema.exp file contains the following remark line:

This definition includes the activity’s objectives and effects.

By the time this gets to Metanorma, it is:

This definition includes the activity\xE2\x80\x99s objectives and effects

i.e. This is a raw UTF-8 encoding of the smart apostrophe, but the file is being processed as 8-bit ASCII, not UTF-8, so Metanorma cannot read it:

incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)

expressir, as with the rest of our stack, must ensure that all files are processed as UTF-8, and that all output is in UTF-8.

ronaldtse commented 3 years ago

Agree. I think this can be easily fixed with reading the file using UTF-8 and apply “normalize_unicode” to ensure the proper characters are normalized.

zakjan commented 3 years ago

I added .force_encoding('UTF-8') to string output from the parser (string literals and remarks) in #46, and updated tests, does it help?

ronaldtse commented 3 years ago

@opoudjis can you help confirm the fix? Thanks!

zakjan commented 3 years ago

I think @opoudjis is very far from this library, we probably need to merge and release so that he can verify it. Merging.

opoudjis commented 3 years ago

I've confirmed it. I am far from the library, but lutaml passes the text through, and it's no longer crashing when I restore the smart apostrophe in the Express source.

zakjan commented 3 years ago

Interesting, it's not released yet :)

ronaldtse commented 3 years ago

@zakjan I believe @opoudjis is using master 😉

zakjan commented 3 years ago

Ok :)