Notification script input encoding issue

jtdaugherty commented 2 years ago

Originally reported elsewhere by @cbay:

Here's what the notification script receives:

{"from":"foo","mention":false,"message":"Ã§a va ?","version":2}

You can see that the encoding is wrong, it should be:

{"from":"foo","mention":false,"message":"ça va ?","version":2}

My LANG is set to en_US.UTF-8, which I do have:

$ localectl list-locales    
C.UTF-8
en_GB.UTF-8
en_US.UTF-8
fr_FR.UTF-8

I have also tried setting LANG=C.UTF-8, no difference.

jtdaugherty commented 2 years ago

@cbay I investigated this a bit and what I found is that the first JSON you posted is UTF-8 encoded, which is what Matterhorn will always produce to the notification script regardless of locale. (That isn't documented, but I'll fix that.) You can see this for yourself by pasting your first JSON example into the "UTF-8-decoded" field at https://mothereff.in/utf-8

The second text you provided is the Unicode after decoding, and that matches what the UTF-8 decoder will produce.

What I would recommend is using a notification handler that is Unicode-aware, i.e., one that can take UTF-8 encoded stdin data and decode it as UTF-8 to get Unicode. The built-in shell script example we provide probably won't be adequate for that. A Python implementation would be able to handle that very well, though.

cbay commented 2 years ago

Thanks for having a look. However, I think you've misdiagnosed the issue. Let me try to explain it with more details.

First, here's my simple notify script to capture what Matterhorn really sends:

#!/bin/sh
cat > /tmp/message

And here's what was captured:

$ cat /tmp/message 
{"from":"foo","mention":false,"message":"Ã§a va ?","version":2}

You're absolutely right, that file is UTF-8 encoded:

$ chardetect /tmp/message 
/tmp/message: utf-8 with confidence 0.7525

But... that's the whole point. Let's dive into that file by showing the bytes:

$ hexdump -C /tmp/message
00000000  7b 22 66 72 6f 6d 22 3a  22 66 6f 6f 22 2c 22 6d  |{"from":"foo","m|
00000010  65 6e 74 69 6f 6e 22 3a  66 61 6c 73 65 2c 22 6d  |ention":false,"m|
00000020  65 73 73 61 67 65 22 3a  22 c3 83 c2 a7 61 20 76  |essage":"....a v|
00000030  61 20 3f 22 2c 22 76 65  72 73 69 6f 6e 22 3a 32  |a ?","version":2|
00000040  7d 0a                                             |}.|
00000042

As you can see, Ã§ is encoded as c3 83 c2 a7. Those are indeed 2 UTF-8 characters, c3 83 is Ã and c2 a7 is §.

But the original message, as showed in Matterhorn, doesn't contain Ã§, it contains ç, which in UTF-8 is encoded as c3 a7 (note how that's the first byte from Ã and the last byte from §).

That's a "common" issue of UTF-8 being mixed with ISO-8859-1 (latin1) somehow. You can read more details here and there, for instance.

Note that I have double-checked and, as far as I know, I have nothing set to ISO-8859-1 on my system, only UTF-8. I'm French and use non-ASCII characters all the time, and no other tool has those symptoms.

Thanks again!

kawas44 commented 2 years ago

Hello,

I use the notification version 2 and a notify script almost similar to what is available in the repo. There is indeed an issue of string encoding out of Matterhorn.

If I test the notification script in a command line there is no issue :

# ok => characters are well written in notification bubble
echo '{"version": 2, "from": "a-test", "message": "ça va ?", "mention": false}' | mm-notify-v2

A similar message through Matterhorn will be seen as if utf8 bytes are read as iso8859-1 encoding. We can simulate the bug in the command line using iconv:

# ko => we tell iconv to read utf8 bytes as if it was iso8859 encoding
echo '{"version": 2, "from": "a-test", "message": "ça va ?", "mention": false}' | iconv -f iso8859-1 | mm-notify-v2

Possible Workaround

At the moment, I can work around this bug by using iconv to read message bytes as utf8 but decoding them as iso8859-1 so we are back to the original message characters.

message=$(echo "$json" | jq -Mr .message | iconv -f utf8 -t iso8859-1)

Hope it helps some :wink:

choucavalier commented 1 year ago

I have the exact same issue. I use dunst as a notification daemon, and it correctly handles utf-8 for all other applications than matterhorn. As others have pointed out, I think the issue comes from the output of matterhorn.

jtdaugherty commented 1 year ago

But the original message, as showed in Matterhorn, doesn't contain Ã§, it contains ç, which in UTF-8 is encoded as c3 a7 (note how that's the first byte from Ã and the last byte from §).

Looking into this again, it appears that the problem is that somewhere along the line, two UTF-8 encodings are happening rather than one. I don't know how that has anything to do with latin1 per se. That is, encodeUtf8(encodeUtf8("ç")) = "Ã§". So I'm investigating to see how that might be taking place in Matterhorn's JSON message to the notification script.

jtdaugherty commented 1 year ago

The problem did indeed turn out to be a double encoding. I've now fixed it in develop and it will go out in the next release. Thanks, everyone, for your help and investigations - it helped me track this down!

matterhorn-chat / matterhorn

Notification script input encoding issue #781