psi-im / iris

XMPP network library
GNU Lesser General Public License v2.1
50 stars 25 forks source link

Message::toStanza damage Unicode chars above U+FFFF #42

Closed pali closed 8 years ago

pali commented 8 years ago

When I send jabber message which contains "😀" (U+1F600 GRINNING FACE) from Kopete then XMPP server immediately disconnect me.

I started debug it and problem looks to be in function Message::toStanza. I added small patch to provide debug info:

diff --git a/src/xmpp/xmpp-im/types.cpp b/src/xmpp/xmpp-im/types.cpp
index 4cbd0cc..4244e87 100644
--- a/src/xmpp/xmpp-im/types.cpp
+++ b/src/xmpp/xmpp-im/types.cpp
@@ -1500,11 +1500,18 @@ Stanza Message::toStanza(Stream *stream) const
    }
    for (it = d->body.constBegin(); it != d->body.constEnd(); ++it) {
        const QString &str = (*it);
+       qDebug() << "Message::toStanza" << "body:" << str;
        if(!str.isEmpty()) {
            QDomElement e = s.createTextElement(s.baseNS(), "body", str);
            if(!it.key().isEmpty())
                e.setAttributeNS(NS_XML, "xml:lang", it.key());
            s.appendChild(e);
+           QString buf;
+           QTextStream stream(&buf, QIODevice::WriteOnly);
+           e.save(stream, 2);
+           QString buf2;
+           for (int i=0; i<buf.size(); ++i) buf2 += QString("0x") + QString::number(buf[i].unicode(), 16) + QString(" ");
+           qDebug() << "Message::toStanza" << "xml:" << buf << "hex:" << buf2;
        }
    }

And with this patch I see debug log:

Message::toStanza body: "😀" 
Message::toStanza xml: "  <body xmlns="jabber:client">#xde00;</body>
" hex: "0x20 0x20 0x3c 0x62 0x6f 0x64 0x79 0x20 0x78 0x6d 0x6c 0x6e 0x73 0x3d 0x22 0x6a 0x61 0x62 0x62 0x65 0x72 0x3a 0x63 0x6c 0x69 0x65 0x6e 0x74 0x22 0x3e 0xd83d 0x26 0x23 0x78 0x64 0x65 0x30 0x30 0x3b 0x3c 0x2f 0x62 0x6f 0x64 0x79 0x3e 0xa "

Looks like that something (maybe s.createTextElement??) replace low surrogate part of U+1F600 to XML notation (&#de00;) (sequence 0x26 0x23 0x78 0x64 0x65 0x30 0x30 0x3b) but high surrogate part stay in wide string (0xd83d).

So for sure this is not problem in Kopete but in some libiris or Qt part.

Any idea why low surrogate part is converted to XML?

Btw &#de00; sequence is invalid in XML, w3c say that valid are just:

x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

I'm using Qt version 4.8.2 and libiris from commit 07353b5ff5317516b27daa9211f7e49cebed9817

pali commented 8 years ago

Fix to prevent disconnect: https://github.com/psi-im/iris/pull/43

pali commented 8 years ago

When I set stream.setCodec("UTF-16"); then I get debug log:

Message::toStanza body: "😀" 
Message::toStanza xml: "  <body xmlns="jabber:client">😀</body>
" hex: "0x20 0x20 0x3c 0x62 0x6f 0x64 0x79 0x20 0x78 0x6d 0x6c 0x6e 0x73 0x3d 0x22 0x6a 0x61 0x62 0x62 0x65 0x72 0x3a 0x63 0x6c 0x69 0x65 0x6e 0x74 0x22 0x3e 0xd83d 0xde00 0x3c 0x2f 0x62 0x6f 0x64 0x79 0x3e 0xa "

Which is correct now!

So looks like it is truth what I wrote in comment https://github.com/psi-im/iris/pull/43#issuecomment-239681381

pali commented 8 years ago

@Ri0n can you change whole QDom conversion to QString with codec UTF-16? QString is internally stored as UTF-16 so this should not break anything... And should fix this bug.

Ri0n commented 8 years ago

As far as I know internal UTF-16 is only in Windows, isn't it?

pali commented 8 years ago

QString's internal encoding is UTF-16. If you access character at some position (operator [] or method at()) then it returns object QChar. And that QChar represent 16bit character. So either full Unicode codepoint or high part of surrogate pair or low part of surrogate pair.

pali commented 8 years ago

Note that QString still represent Unicoded string, so when you want to use it, you first need to convert it to some 8bit encoding (e.g. UTF-8 or UTF-16 or whatever).

pali commented 8 years ago

Applying similar fix as in https://github.com/psi-im/iris/pull/44 to Kopete history plugin fixes also support for storing messages with Unicode characters above 0xFFFF (Kopete store chat history into XML by QtXML too).

Ri0n commented 8 years ago

So it's not reproducible with Qt5 but reproducible with 4.8.7 (tested with code at https://bugreports.qt.io/browse/QTBUG-25291). But I think your last workaround is good enough to apply it for both versions nevertheless.

pali commented 8 years ago

Look at documentation of QTextStream: https://doc.qt.io/qt-4.8/qtextstream.html

Internally, QTextStream uses a Unicode based buffer, and QTextCodec is used by QTextStream to automatically support different character sets. By default, QTextCodec::codecForLocale() is used for reading and writing, but you can also set the codec by calling setCodec().

It is really fragile as by default is can use "random" encoding. User change language in system and different encoding will be chosen by Qt! So I would suggest to manually set codec for all QTextStream to prevent such problems.