Closed janvainer closed 4 years ago
Those are not "broken character sequences". Those are escaped sequences, for example, "\u017e" corresponds to "ž" (Latin Small Letter Z with Caron https://www.compart.com/en/unicode/U+017E ).
The JSON RFC (https://www.ietf.org/rfc/rfc4627.txt) section 2.5 states:
Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point. The hexadecimal letters A though F can be upper or lowercase. So, for example, a string containing only a single reverse solidus character may be represented as "\u005C".
So the output of the "JSON" option of aeneas is JSON valid. It is the code that consumes that JSON file that needs to be able to accept it, and, if that is your will, unescape it into a Unicode codepoint, and from them to a rendered glyph like "ž".
Having said so, aeneas simply uses the default Python json
module, so there is nothing I can do about their implementation choice: they decided to escape all non-ASCII character.
If you really really want, you can change this line:
https://github.com/readbeyond/aeneas/blob/master/aeneas/syncmap/__init__.py#L273
from
json.dumps({"fragments": output_fragments}, indent=1, sort_keys=True)
to
json.dumps({"fragments": output_fragments}, indent=1, sort_keys=True, ensure_ascii=False)
that is, add ensure_ascii=False
optional argument.
Mind: depending on the encoding of your console, it might or might not yield the results you expect, or even error out!
Alright, thank you very much. I get it now :) We can close the issue.
I work with texts in Czech, which means characters such as ěščřž... I encode my source text file as utf-8 and run the following.
Unfortunately, the map.json file is encoded as ascii and contains broken character sequences. For example: "Kdy\u017e se to narodilo, bylo to jenom takov\u00e9 b\u00edl\u00e9" should be "Když se to narodilo, bylo to jenom takové bíle" How can I set the output encoding?