tfussell / xlnt

:bar_chart: Cross-platform user-friendly xlsx library for C++11+
Other
1.5k stars 423 forks source link

Issues reading text from cells with non ASCII characters #202

Closed carterbc closed 7 years ago

carterbc commented 7 years ago

I'm having an issue with trying to read text from a cell that contains accents, Cyrillic, and/or Chinese characters. I have no problems when writing the text, but when reading the text I get a lot of unexpected characters which hints to me it has something to do with std::string and not supporting unicode?

I feel like I'm probably being stupid and there should be someway to handle this... but I can't figure it out. Is there a way to return a wstring using the cell's value method? If not, is there an alternative approach? How hard would it be to implement a version of value that returns wstring? Any advice would be greatly appreciated.

Here is what I'm trying (I'm using the library within Qt -- QString supports unicode): QString newFrenchText = QString::fromStdString(ws.cell(3, currRow).value<std::string>());

The special characters that were in the cell turn into random chars in newFrenchText... Standard ASCII characters stay the same.

Thank you

paulharris commented 7 years ago

QString supports "unicode", but that just means the storage format (in ram) is UTF-16 ie same as Java, Javascript engines and Windows NTFS etc.

UTF-8 is everywhere else, including the web. Hopefully UTF-16 will die one day, its just easier.

If you use toStdString() or similar, it does not convert it to anything except latin1, ie uses a sword and messes up your text.

If you want a wstring, then that is based on wchar_t, which is either 2 characters (on windows) and 4 characters (on linux), ie windows = utf-16 and linux = utf-32

You can use toStdWString() in QT but that may not convert from utf-16 to utf-32 as you'd expect.

Instead, you probably want to convert UTF-16 to UTF-8 I think QT has a function for that, probably toUtf8() or something similar. I think it used to dump an array of chars that you can then copy into a string.

or use utfcpp which is a library that the xInt guys have started using. Your input is the raw utf-16 chars within the QString, one of its methods gives you the pointer which is all you really need (no extra processing by the QT code).

for more info on what encoding you want in the end, google "utf8-everywhere", but the short answer is you probably want utf-8 everywhere and then convert to utf-16 when you need to talk to Windows / Java APIs.

eg If you are printing to the command line in linux, or writing to a text file, then convert utf-16 to utf-8 and print that out.

On 23 August 2017 at 07:53, CarterBC notifications@github.com wrote:

I'm having an issue with trying to read text from a cell that contains accents, Cyrillic, and/or Chinese characters. I have no problems when writing the text, but when reading the text I get a lot of unexpected characters which hints to me it has something to do with std::string and not supporting unicode?

I feel like I'm probably being stupid and there should be someway to handle this... but I can't figure it out. Is there a way to return a wstring using the cell's value method? If not, is there an alternative approach? How hard would it be to implement a version of value that returns wstring? Any advice would be greatly appreciated.

Here is what I'm trying (I'm using the library within Qt -- QString supports unicode): QString newFrenchText = QString::fromStdString(ws.cell(3, currRow).value());

The special characters that were in the cell turn into random chars in newFrenchText... Standard ASCII characters stay the same.

Thank you

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tfussell/xlnt/issues/202, or mute the thread https://github.com/notifications/unsubscribe-auth/ABkgS4yi-GkEF5iw4R1v7dwAd7Q28lvKks5sa2nvgaJpZM4O_Upj .

carterbc commented 7 years ago

Thank you for the great explanation. It definitely helped!

I got it working. There might be a cleaner way, but for anyone who sees this in the future with a similar problem, here's what I did:

string newFrenchText = ws.cell(3, currRow).value<std::string>();
const char* cNewFrenchText = newFrenchText.c_str();
QString qNewFrenchText = QString::fromUtf8(cNewFrenchText);
paulharris commented 7 years ago

That looks good to me :)

On 24 Aug. 2017 12:20 am, "CarterBC" notifications@github.com wrote:

Thank you for the great explanation. It definitely helped!

I got it working. There might be a cleaner way, but for anyone who sees this in the future with a similar problem, here's what I did:

string newFrenchText = ws.cell(3, currRow).value(); const char* cNewFrenchText = newFrenchText.c_str(); QString qNewFrenchText = QString::fromUtf8(cNewFrenchText);

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tfussell/xlnt/issues/202#issuecomment-324387755, or mute the thread https://github.com/notifications/unsubscribe-auth/ABkgS_GzWTU7ulaiUdu9RJS6YcBvzO00ks5sbFFWgaJpZM4O_Upj .

tfussell commented 7 years ago

Thanks for helping out with info @paulharris. I'll be considering whether it would make sense to create a std::wstring overload for cell::value<>() on Windows platforms since there are already wstring overloads for reading and writing to the file system.

paulharris commented 7 years ago

Yes might be handy, convert from utf8 to utf16 and dump to/from a wstring. Much easier than the end library user trying to learn about unicode and which function to use.

I would also add methods like this...

class Cell { // whatever it is ... std::string value_utf8() const { return this->value(); } std::wstring value_utf16() const { return this->value(); } }

and the appropriate set methods, however you like to name it.

The point is to make it explicit that if you call value_utf8() you will get a utf8 encoded string, and ditto for utf16.

You could also add some free-function style overloads for QT (without depending on QT library, if user #includes the header then they get that support).

ie

// xint_qt.hpp

pragma once

include

namespace blah blah {

inline QString fromCell( Cell const& cell ) { return QString::fromStdWString( cell.value_utf16() ); }

inline void setCell( Cell & cell, QString const& s ) { blah blah }

}

Then the call is simply

QString newFrenchText = xintns::fromCell(ws.cell(3, currRow));

no need to learn about unicode.

On 24 August 2017 at 10:21, Thomas Fussell notifications@github.com wrote:

Thanks for helping out with info @paulharris https://github.com/paulharris. I'll be considering whether it would make sense to create a std::wstring overload for cell::value<>() on Windows platforms since there are already wstring overloads for reading and writing to the file system.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tfussell/xlnt/issues/202#issuecomment-324514064, or mute the thread https://github.com/notifications/unsubscribe-auth/ABkgS7gzRItVGw5bIkgWEOfeCsT87Xkzks5sbN4qgaJpZM4O_Upj .