Closed jtdaugherty closed 1 year ago
@cbay I investigated this a bit and what I found is that the first JSON you posted is UTF-8 encoded, which is what Matterhorn will always produce to the notification script regardless of locale. (That isn't documented, but I'll fix that.) You can see this for yourself by pasting your first JSON example into the "UTF-8-decoded" field at https://mothereff.in/utf-8
The second text you provided is the Unicode after decoding, and that matches what the UTF-8 decoder will produce.
What I would recommend is using a notification handler that is Unicode-aware, i.e., one that can take UTF-8 encoded stdin data and decode it as UTF-8 to get Unicode. The built-in shell script example we provide probably won't be adequate for that. A Python implementation would be able to handle that very well, though.
Thanks for having a look. However, I think you've misdiagnosed the issue. Let me try to explain it with more details.
First, here's my simple notify script to capture what Matterhorn really sends:
#!/bin/sh
cat > /tmp/message
And here's what was captured:
$ cat /tmp/message
{"from":"foo","mention":false,"message":"ça va ?","version":2}
You're absolutely right, that file is UTF-8 encoded:
$ chardetect /tmp/message
/tmp/message: utf-8 with confidence 0.7525
But... that's the whole point. Let's dive into that file by showing the bytes:
$ hexdump -C /tmp/message
00000000 7b 22 66 72 6f 6d 22 3a 22 66 6f 6f 22 2c 22 6d |{"from":"foo","m|
00000010 65 6e 74 69 6f 6e 22 3a 66 61 6c 73 65 2c 22 6d |ention":false,"m|
00000020 65 73 73 61 67 65 22 3a 22 c3 83 c2 a7 61 20 76 |essage":"....a v|
00000030 61 20 3f 22 2c 22 76 65 72 73 69 6f 6e 22 3a 32 |a ?","version":2|
00000040 7d 0a |}.|
00000042
As you can see, ç
is encoded as c3 83 c2 a7
. Those are indeed 2 UTF-8 characters, c3 83
is Ã
and c2 a7
is §
.
But the original message, as showed in Matterhorn, doesn't contain ç
, it contains ç
, which in UTF-8 is encoded as c3 a7
(note how that's the first byte from Ã
and the last byte from §
).
That's a "common" issue of UTF-8 being mixed with ISO-8859-1 (latin1) somehow. You can read more details here and there, for instance.
Note that I have double-checked and, as far as I know, I have nothing set to ISO-8859-1 on my system, only UTF-8. I'm French and use non-ASCII characters all the time, and no other tool has those symptoms.
Thanks again!
Hello,
I use the notification version 2 and a notify script almost similar to what is available in the repo. There is indeed an issue of string encoding out of Matterhorn.
If I test the notification script in a command line there is no issue :
# ok => characters are well written in notification bubble
echo '{"version": 2, "from": "a-test", "message": "ça va ?", "mention": false}' | mm-notify-v2
A similar message through Matterhorn will be seen as if utf8 bytes are read as iso8859-1 encoding. We can simulate the bug in the command line using iconv
:
# ko => we tell iconv to read utf8 bytes as if it was iso8859 encoding
echo '{"version": 2, "from": "a-test", "message": "ça va ?", "mention": false}' | iconv -f iso8859-1 | mm-notify-v2
Possible Workaround
At the moment, I can work around this bug by using iconv
to read message bytes as utf8 but decoding them as iso8859-1 so we are back to the original message characters.
message=$(echo "$json" | jq -Mr .message | iconv -f utf8 -t iso8859-1)
Hope it helps some :wink:
I have the exact same issue. I use dunst
as a notification daemon, and it correctly handles utf-8 for all other applications than matterhorn
. As others have pointed out, I think the issue comes from the output of matterhorn
.
But the original message, as showed in Matterhorn, doesn't contain ç, it contains ç, which in UTF-8 is encoded as c3 a7 (note how that's the first byte from à and the last byte from §).
Looking into this again, it appears that the problem is that somewhere along the line, two UTF-8 encodings are happening rather than one. I don't know how that has anything to do with latin1
per se. That is, encodeUtf8(encodeUtf8("ç")) = "ç"
. So I'm investigating to see how that might be taking place in Matterhorn's JSON message to the notification script.
The problem did indeed turn out to be a double encoding. I've now fixed it in develop
and it will go out in the next release. Thanks, everyone, for your help and investigations - it helped me track this down!
Originally reported elsewhere by @cbay:
Here's what the notification script receives:
You can see that the encoding is wrong, it should be:
My
LANG
is set toen_US.UTF-8
, which I do have:I have also tried setting
LANG=C.UTF-8
, no difference.