realXtend / tundra

realXtend Tundra SDK, a 3D virtual world application platform.
www.realxtend.org
Apache License 2.0
84 stars 70 forks source link

Scene::CreateContentFromXml QString overload unicode problems #540

Open jonnenauha opened 11 years ago

jonnenauha commented 11 years ago

Scene::CreateContentFromXml has two overloads with QString and QDomDocument. My finding is that if you have a QByteArray that has UTF-8 data and you call the function, an QString is automatically constructed with that QByteArray and the QString overload is used. This is expected.

But the scene data gets corrupted and string values that had unicode chars have garbage in them. Using the following fixes the issue:

CreateContentFromXml(QString::fromUtf8(sceneData.data(), sceneData.size()), ...);

Also as many of our other methods in Scene do when you load a txml from *file', you can use a QTextStream with explicit UTF-8 coded, and calling the QDomDocument overload.

QTextStream txmlStream(sceneData);
txmlStream.setCodec("UTF-8");
txmlStream.setAutoDetectUnicode(true);

QDomDocument sceneDocument("Scene");
QString errorMsg; int errorLine, errorColumn;
if (!sceneDocument.setContent(txmlStream.readAll(), &errorMsg, &errorLine, &errorColumn))
{
    LogError(QString("Parsing scene XML failed: %1 at line %2 column %3.").arg(errorMsg).arg(errorLine).arg(errorColumn));
    return;
}
activeScene->CreateContentFromXml(sceneDocument, ...);

This bug would also haunt our --file my.txml startup loading if it would use the IAssets QByteArray data to load, but instead it uses the transfers disk source to load the file, which has the code for QTextStream with UTF-8.

The question remains

Should we have consistent assumption that all txml data is UTF-8 for both saving and loading. Currently saving uses GetXml that returns a QByteArray with automatic encoding. That can be stored to QFile and everything works, then you can load it with the disk file and everything works.

Should we change some part of the behaviour to assume UTF-8 or is it the callers responsibility to do this before calling the QString overload? This was a very elusive but and took me a while to hunt down. It seems that nothing will use this "broken" QString overload whatever you do with startup parameters or use the Tundra UI to save/import/export txml files, those work as they assume UTF-8 and use QFile to write them properly. The problem arises when a script or c++ code wants to load in memory QString/QByteArray directly to the scene.

Another place that is taking in QByteArray but reading it as UTF-8 via QTextStream is Scene::CreateSceneDescFromXml(QByteArray &data, SceneDesc &sceneDesc) which again differs from the assumption we make in the CreateContentFromXml(QString, ...) overload.

What to do? Or should we do nothing? Is this even an issue? Just want some input here.

Additionally should we write <?xml version=”1.0” encoding=”UTF-8”?> into the start of each .txml file we write?

P.S. The QByteArray is fetched via normal QNetworkAccessManager::get() function and the content is not manipulated in any way before pushing it to the QString overload. Additionally the content-type of the web file is text/xml. If these things are relevant. Downloading the file with a browser etc. shows the unicode chars just fine, its not broken in the source.

juj commented 11 years ago

I wish Qt did not have an implicit constructor in QString to take in a QByteArray. That is a seriously poor design decision. They're modelling that each byte blob array is equivalent to an ascii-encoded string. It is akin to e.g. allowing implicit conversions from void * to string or similar.

There is no bug here, except that we want to disable all kinds of implicit conversions from QByteArray to anything (except we can't, without modifying Qt code). A QByteArray is just a raw byte array that is not associated with any metadata that would identify what the byte array contains exactly. Therefore the information needs to explicitly come from whoever provides data in QByteArray form.

As a fix to avoid users from calling CreateContentFromXml(someQByteArray), I recommend declaring a function 'void CreateContentFromXml(const QByteArray &, bool, AttributeChange::Type);' but not defining it, which will give a compiler error if anyone calls CreateContentFromXml(someQByteArray), and have the documentation hint that explicit conversion is required. Qt QString(QByteArray) ctor has assumed the QByteArray must contain ASCII, but there is no reason we should assume something different (e.g. UTF-8) and pretend it'd be anything more correct.

.txml data is xml, and the encoding is specified in the beginning of the file (and secondarily using BOM), according to http://www.opentag.com/xfaq_enc.htm . Users are free to encode xml with whatever encoding they desire, I assume Qt should be able to handle all of them. Specifying encoding explicitly should not be necessary for UTF-8, as it is assumed if neither "encoding"-attribute and BOM is not present, but I don't mind even if it's explicitly specified.