Closed wonder-sk closed 2 years ago
What do you do internally? Windows doesn't normally do UTF-8. I think std::filesystem::path uses the encoding of the target system.
When calling untwine, we convert the input filename to utf-8 (we could convert the filename to some other encoding, but utf-8 is probably most universal).
In QGIS we use unicode strings (QString) everywhere, and everything related to paths is handled by Qt classes (such as QFile, QDir, QFileInfo) that work with QString filenames. I have not studied how Qt deals with unicode paths.
I had brief experience with unicode path issues on windws with c++ in geodiff project, where we ended up using std::wstring
variants of functions/constructors to make things work, e.g. here https://github.com/lutraconsulting/geodiff/blob/bfcae12b3427c05ee82596510ada3fe3dd7167ad/geodiff/src/geodiffutils.cpp#L407
I'm confused on this. I don't know that the pdal info
issue is related. AFAICT, the default codepage (the one on my machine, at least) is 437, which has no representation for č
. I'm not even sure how it's displayed when I paste the filename into cmd.exe. But when pdal info
gets the filename, the little thing on top of the c
is gone, which makes sense, because how would it be represented?
But Untwine is passing the filename as a string, which is a std::stringč
in it successfully when that name is, say, read from a pipeline file -- on windows we convert the UTF-8 name to UTF-16 before we open the file.
Anyway, I'll try some more stuff, but I'm hoping someone can tell me the magic that I'm missing here. It could have something to do with the encoding of the characters of the filename as they're passed to the WIN32 API to start Untwine.
I wish I could provide some good advice, but I don't have one much knowledge to offer here. Just some thoughts:
cmd
as it may add an extra layer of encoding/decoding of unicode strings - for testsing I would probably hardcode filename (making sure the source code editor interprets the characters using the same encoding as the compiler - ideally utf-8)c
gone - maybe a font issue as many fonts only support a limited range of characters and don't display various Unicode characters correctlyOn linux things are somehow easier, as it is nowadays common to use utf-8 everywhere for encoding of unicode strings, but windows has its history will all the win32 api extensions to handle unicode strings...
I'm going to need more information if I'm going to fix this. Perhaps having the error handling in will allow you to figure out what's wrong. I created a las file called <something>\č\č.las
and used the test program that's with the API to run Untwine successfully on Windows 10 on that as a source. Perhaps you could look at the test program and/or rerun with the new error handling to help diagnose the issue. Maybe at least verify that the test program works in your environment?
Note that I'm expecting the file specification to be in UTF-8. I think that's what you said above. PDAL converts to UTF-16 on Windows.
Okay, thanks for the update. Right now the plan is that after QGIS 3.24 release in mid-February we will switch to newer PDAL + laz-perf + untwine in order to get COPC support and the other untwine fixes, and then we will follow up on this...
Verified this is still a problem on Windows using untwine from qgis
branch.
To reproduce - take a file with unicode characters and try to convert it with untwine.exe or check with "pdal info". I have used OSGeo4W.
It looks like the actual problem is in pdal - whatever error untwine reports is also reported by "pdal info". It seems that PDAL is trying to do some kind of conversion, and it either corrupts the filename, or the conversion fails.
Various results I got from testing:
čučoriedky.laz
, PDAL tries to open cucoriedky.laz
(the č
gets converted to c
but it shouldn't). Similarly, with ťuťo.laz
PDAL tries to open tuto.laz
(if I actually have tuto.laz
in the same folder, pdal will open it successfully)šušo.laz
the conversion just failed with "bad conversion" error message - from a quick google search it looks like the error is coming from some Windows API libs☺.laz
PDAL tries to open ?.laz
You can see the untwine error is the same as "pdal info" error.
In OSGeo4W, just install "qgis-dev" package - you will get pdal (directly available in path of osgeo4w shell) and untwine (in c:\osgeo4w\apps\qgis-dev\untwine.exe).
Does this same failure occur with the test program that comes with the API? I'm really not sure what you're doing on the command line, but when I tested I saw that you can cut and paste what appear to be non-ascii characters, but because the terminal codepage doesn't support those characters, they're somehow modified to ascii characters before being passed to the program.
You should test with the API in order to confirm the issue and you need to make sure that the data is encoded as UTF-8 when passed through the API.
Does this same failure occur with the test program that comes with the API?
I don't know, I don't have dev environment on Windows to try to compile and run the test program, sorry...
but because the terminal codepage doesn't support those characters, they're somehow modified to ascii characters before being passed to the program.
The terminal does not seem to be the issue. If I try to open any of my unicode-named files with notepad from the same terminal where I run "pdal info", it opens fine, so I assume the filename is passed correctly, just PDAL does some bad conversion:
(the file here is using "smiley face" unicode character in case you'd like to try it yourself: U+263A)
I probably misspoke about the codepage, but PDAL doesn't use WinMain, it uses main(). Here is an example program that shows the same issue that has nothing to do with PDAL or conversion. It's simply that main() doesn't do UTF-16 characters. We'd have to do another entry point thing to deal with that. But this is beside the point. That PDAL doesn't accept UTF-16 at the command line isn't the issue. What we're talking about is sending UTF-8 through the Untwine API and having it properly converted to UTF-16 when the file is opened. We have tests for this in PDAL and it appears to work correctly.
#include <iostream>
#include <fstream>
int main(int argc, char *argv[])
{
std::cerr << "Number of args = " << argc << "!\n";
std::cerr << "Argv[1] = " << argv[1] << "!\n";
std::cerr << "Strlen(argv[1]) = " << strlen(argv[1]) << "!\n";
return 0;
}
Output:
acbell@DESKTOP-ONOFO8B:/mnt/c/Users/andre/c++$ ./test2.exe čučoriedky.las
Number of args = 2!
Argv[1] = cucoriedky.las!
Strlen(argv[1]) = 14!
On QGIS side everything seems to be correct: here in QGIS code is where the untwine input file gets converted https://github.com/qgis/QGIS/blob/master/src/providers/pdal/qgspdaleptgenerationtask.cpp#L85 - from QString which is unicode string representation, using QString::toStdString() which converts it to utf-8 encoded std::string (https://doc.qt.io/qt-5/qstring.html#toStdString).
Here is the test program I'm using:
#ifndef _WIN32
#include <unistd.h>
#endif
#include <iostream>
#include "QgisUntwine.hpp"
int main()
{
untwine::QgisUntwine::StringList files;
untwine::QgisUntwine::Options options;
std::string exe = "C:\\Users\\andre\\untwine\\build\\untwine.exe";
untwine::QgisUntwine api(exe);
std::vector<unsigned char> funnycVec = { 0xc4, 0x8d };
std::string funnyc(funnycVec.begin(), funnycVec.end());
std::string v8string { "C:\\Users\\andre\\untwine\\build\\" + funnyc + "\\" + funnyc + ".las" };
files.push_back(v8string);
bool ok = api.start(files, "./out", options);
if (! ok)
{
std::cerr << "Couldn't start '" << exe << "!\n";
exit(-1);
}
while (true)
{
#ifdef _WIN32
Sleep(1000);
#else
::sleep(1);
#endif
int percent = api.progressPercent();
std::string s = api.progressMessage();
std::cerr << "Percent/Msg = " << percent << " / " << s << "!\n";
if (!api.running())
break;
}
std::cerr << "Error = " << api.errorMessage() << "\n";
}
The file I'm using is "\<whatever>/č/č.las". You'll have to change the paths in the program above to suit your environment. Anyway, this test works fine for me and generates the expected EPT output.
@abellgithub I think I finally understand what's going wrong. Thanks for your test program, it helped.
I have verified the input filename gets correctly passed as utf-8 to the child untwine process. The actual issue is with the output directory. In your test program it's just ./out
, but in QGIS we use basename of the original file, so for hello.laz
we would create output directory ept_hello
.
If you modify your test program where it calls start(), you should be able to replicate the bug:
std::string outDir{ "./out_" + funnyc };
bool ok = api.start(files, outDir, options);
What happens is that untwine starts, it creates the output directory and sub-directories (ept-data, ept-hierarchy, temp), and it seems to start reading the input, but then fails to write the temporary files and aborts. My guess is that the problem could be in Writer::run()
where it uses std::ofstream
with the utf-8 path, but probably the path is interpreted as being encoded in the active code page? (which is cp1252 also for me)
A linked issue is that when this happens, there's no error message passed to the API client, which made the debugging quite time consuming, so it would be good if the error reporting could be fixed as well for this case...
Thanks. There are a couple of issues WRT UTF-16 that need to be addressed for temp files.
@wonder-sk : Any chance you could test this: https://github.com/hobuinc/untwine/tree/issue-56
@abellgithub unfortunately I am unable to test it as I do not have full dev environment on windows (and it would take quite long to get it working). But I have looked at the code, and the changes look sensible. So I am happy if you merge all that to qgis
branch, from where I would pull untwine source code to QGIS repo, so we get new windows builds with the fix sometime next week...
Thanks a lot for your work on the fix!
I will close this as the original issue has been fixed, but there are more bugs related to unicode paths: #119 and #120
On Windows, trying to index a non-ascii input file path e.g.
c:\Users\martin\Downloads\čučoriedky\test.laz
will fail and no EPT output is created (only empty ept-data, ept-hierarchy, temp directories). This is the case when indexing is started from QGIS throughQgisUntwine
API or directly from command line. QGIS converts file names to UTF-8 before passing them to Untwine. Not sure what encoding windows command line uses.On Linux, things work fine.
It could be PDAL bug - running
pdal info c:\Users\martin\Downloads\čučoriedky\test.laz
incmd
returns this:(for non-unicode filename
pdal info
worked fine)Downstream bug report in QGIS: https://github.com/qgis/QGIS/issues/41833