psi-im / iris

XMPP network library
GNU Lesser General Public License v2.1
50 stars 25 forks source link

Drop invalid XML sequences in sanitizeForStream #43

Closed pali closed 5 months ago

pali commented 8 years ago

Make sure that we send only valid XML sequence to XMPP server. Otherwise server can disconnect us.

Ri0n commented 8 years ago

Have you found a real source of the problem? I faced something like this with older Qt versions but never with 4.8.7 if I remember correctly.

pali commented 8 years ago

See reported issue https://github.com/psi-im/iris/issues/42 with my investigation

Ri0n commented 8 years ago

Well this fix for me looks like a workaround. Maybe well-implemented workaround but still workaround.

pali commented 8 years ago

I bet here is problem: http://code.qt.io/cgit/qt/qt.git/tree/src/xml/dom/qdom.cpp#n4217

whole while loop is iterating over UTF-16 codepoints. But Unicode characters above U+FFFF are represented by two surrogate pairs in UTF-16 and above code try to encode them separately.

Such thing can work only for UTF-16, not for UTF-8 (what is I believe used in iris). I can image that first check for canEncode in UTF-8 can return true (as expect UTF-16 surrogate pairs), but second call return false and so we have that low surrogate encoded in XML &#x????; representation...

Ri0n commented 8 years ago

It's still interesting if it's reproducible with Qt-4.8.7. I understand the link above points to recent sources but anyway.

pali commented 8 years ago

This is just prevention to not send any "malformed" or "incorrect" data via network to XMPP server.

pali commented 8 years ago

I do not have 4.8.7 (just 4.8.2), but you can try it yourself with character 😀 (U+1F600 GRINNING FACE). On 4.8.2 it is definitely problem.

pali commented 8 years ago

That XML bug is reported in Qt bugtracker: https://bugreports.qt.io/browse/QTBUG-25291

Volker Götz in first comment said exactly what I understand from that Qt source code.

pali commented 8 years ago

So... do you think that it is useful to have this check for valid XML in sanitize function?

Ri0n commented 8 years ago

I'm not sure. I'd avoid any overhead where possible. And if it's not that CPU consuming it still takes time to support this code.

Probably it's better to keep the code unchanged to understand where real problems happen.

Ri0n commented 5 months ago

@pali thank you very much for the investigation :) your second fix worked for years in Psi without any issue. Btw, for some reason with Qt6 it seems to be not needed. Maybe they changed their defaults.

In the docs I see

 By default, UTF-8 is used for reading and writing

but in same doc something about locale-aware. But we anyway disabled the setCodec() trick for Qt6+