readbeyond / aeneas

aeneas is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment)
http://www.readbeyond.it/aeneas/
GNU Affero General Public License v3.0
2.45k stars 218 forks source link

How to get utf-8 on output? #244

Closed janvainer closed 4 years ago

janvainer commented 4 years ago

I work with texts in Czech, which means characters such as ěščřž... I encode my source text file as utf-8 and run the following.

python -m aeneas.tools.execute_task audio.mp3 \
   text.txt \
   "task_language=cs|os_task_file_format=json|is_text_type=plain" \ 
   map.json

Unfortunately, the map.json file is encoded as ascii and contains broken character sequences. For example: "Kdy\u017e se to narodilo, bylo to jenom takov\u00e9 b\u00edl\u00e9" should be "Když se to narodilo, bylo to jenom takové bíle" How can I set the output encoding?

pettarin commented 4 years ago

Those are not "broken character sequences". Those are escaped sequences, for example, "\u017e" corresponds to "ž" (Latin Small Letter Z with Caron https://www.compart.com/en/unicode/U+017E ).

The JSON RFC (https://www.ietf.org/rfc/rfc4627.txt) section 2.5 states:

Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point. The hexadecimal letters A though F can be upper or lowercase. So, for example, a string containing only a single reverse solidus character may be represented as "\u005C".

So the output of the "JSON" option of aeneas is JSON valid. It is the code that consumes that JSON file that needs to be able to accept it, and, if that is your will, unescape it into a Unicode codepoint, and from them to a rendered glyph like "ž".

Having said so, aeneas simply uses the default Python json module, so there is nothing I can do about their implementation choice: they decided to escape all non-ASCII character.

pettarin commented 4 years ago

If you really really want, you can change this line:

https://github.com/readbeyond/aeneas/blob/master/aeneas/syncmap/__init__.py#L273

from

json.dumps({"fragments": output_fragments}, indent=1, sort_keys=True)

to

json.dumps({"fragments": output_fragments}, indent=1, sort_keys=True, ensure_ascii=False)

that is, add ensure_ascii=False optional argument.

Mind: depending on the encoding of your console, it might or might not yield the results you expect, or even error out!

janvainer commented 4 years ago

Alright, thank you very much. I get it now :) We can close the issue.