hobuinc / untwine

GNU General Public License v3.0
52 stars 22 forks source link

Untwine on Windows fails with non-ascii paths #56

Closed wonder-sk closed 2 years ago

wonder-sk commented 3 years ago

On Windows, trying to index a non-ascii input file path e.g. c:\Users\martin\Downloads\čučoriedky\test.laz will fail and no EPT output is created (only empty ept-data, ept-hierarchy, temp directories). This is the case when indexing is started from QGIS through QgisUntwine API or directly from command line. QGIS converts file names to UTF-8 before passing them to Untwine. Not sure what encoding windows command line uses.

On Linux, things work fine.

It could be PDAL bug - running pdal info c:\Users\martin\Downloads\čučoriedky\test.laz in cmd returns this:

PDAL: bad conversion

(for non-unicode filename pdal info worked fine)

Downstream bug report in QGIS: https://github.com/qgis/QGIS/issues/41833

abellgithub commented 2 years ago

What do you do internally? Windows doesn't normally do UTF-8. I think std::filesystem::path uses the encoding of the target system.

wonder-sk commented 2 years ago

When calling untwine, we convert the input filename to utf-8 (we could convert the filename to some other encoding, but utf-8 is probably most universal).

In QGIS we use unicode strings (QString) everywhere, and everything related to paths is handled by Qt classes (such as QFile, QDir, QFileInfo) that work with QString filenames. I have not studied how Qt deals with unicode paths.

I had brief experience with unicode path issues on windws with c++ in geodiff project, where we ended up using std::wstring variants of functions/constructors to make things work, e.g. here https://github.com/lutraconsulting/geodiff/blob/bfcae12b3427c05ee82596510ada3fe3dd7167ad/geodiff/src/geodiffutils.cpp#L407

abellgithub commented 2 years ago

I'm confused on this. I don't know that the pdal info issue is related. AFAICT, the default codepage (the one on my machine, at least) is 437, which has no representation for č. I'm not even sure how it's displayed when I paste the filename into cmd.exe. But when pdal info gets the filename, the little thing on top of the c is gone, which makes sense, because how would it be represented?

But Untwine is passing the filename as a string, which is a std::string, which should pass the UTF-8 name through, and I've checked that we open a file with the č in it successfully when that name is, say, read from a pipeline file -- on windows we convert the UTF-8 name to UTF-16 before we open the file.

Anyway, I'll try some more stuff, but I'm hoping someone can tell me the magic that I'm missing here. It could have something to do with the encoding of the characters of the filename as they're passed to the WIN32 API to start Untwine.

wonder-sk commented 2 years ago

I wish I could provide some good advice, but I don't have one much knowledge to offer here. Just some thoughts:

On linux things are somehow easier, as it is nowadays common to use utf-8 everywhere for encoding of unicode strings, but windows has its history will all the win32 api extensions to handle unicode strings...

abellgithub commented 2 years ago

I'm going to need more information if I'm going to fix this. Perhaps having the error handling in will allow you to figure out what's wrong. I created a las file called <something>\č\č.las and used the test program that's with the API to run Untwine successfully on Windows 10 on that as a source. Perhaps you could look at the test program and/or rerun with the new error handling to help diagnose the issue. Maybe at least verify that the test program works in your environment?

Note that I'm expecting the file specification to be in UTF-8. I think that's what you said above. PDAL converts to UTF-16 on Windows.

wonder-sk commented 2 years ago

Okay, thanks for the update. Right now the plan is that after QGIS 3.24 release in mid-February we will switch to newer PDAL + laz-perf + untwine in order to get COPC support and the other untwine fixes, and then we will follow up on this...

wonder-sk commented 2 years ago

Verified this is still a problem on Windows using untwine from qgis branch.

To reproduce - take a file with unicode characters and try to convert it with untwine.exe or check with "pdal info". I have used OSGeo4W.

It looks like the actual problem is in pdal - whatever error untwine reports is also reported by "pdal info". It seems that PDAL is trying to do some kind of conversion, and it either corrupts the filename, or the conversion fails.

Various results I got from testing:

pdal-unicode-chars

image

You can see the untwine error is the same as "pdal info" error.

In OSGeo4W, just install "qgis-dev" package - you will get pdal (directly available in path of osgeo4w shell) and untwine (in c:\osgeo4w\apps\qgis-dev\untwine.exe).

abellgithub commented 2 years ago

Does this same failure occur with the test program that comes with the API? I'm really not sure what you're doing on the command line, but when I tested I saw that you can cut and paste what appear to be non-ascii characters, but because the terminal codepage doesn't support those characters, they're somehow modified to ascii characters before being passed to the program.

You should test with the API in order to confirm the issue and you need to make sure that the data is encoded as UTF-8 when passed through the API.

wonder-sk commented 2 years ago

Does this same failure occur with the test program that comes with the API?

I don't know, I don't have dev environment on Windows to try to compile and run the test program, sorry...

but because the terminal codepage doesn't support those characters, they're somehow modified to ascii characters before being passed to the program.

The terminal does not seem to be the issue. If I try to open any of my unicode-named files with notepad from the same terminal where I run "pdal info", it opens fine, so I assume the filename is passed correctly, just PDAL does some bad conversion:

image

(the file here is using "smiley face" unicode character in case you'd like to try it yourself: U+263A)

abellgithub commented 2 years ago

I probably misspoke about the codepage, but PDAL doesn't use WinMain, it uses main(). Here is an example program that shows the same issue that has nothing to do with PDAL or conversion. It's simply that main() doesn't do UTF-16 characters. We'd have to do another entry point thing to deal with that. But this is beside the point. That PDAL doesn't accept UTF-16 at the command line isn't the issue. What we're talking about is sending UTF-8 through the Untwine API and having it properly converted to UTF-16 when the file is opened. We have tests for this in PDAL and it appears to work correctly.

#include <iostream>
#include <fstream>

int main(int argc, char *argv[])
{
    std::cerr << "Number of args = " << argc << "!\n";
    std::cerr << "Argv[1] = " << argv[1] << "!\n";
    std::cerr << "Strlen(argv[1]) = " << strlen(argv[1]) << "!\n";

    return 0;
}

Output:

acbell@DESKTOP-ONOFO8B:/mnt/c/Users/andre/c++$ ./test2.exe čučoriedky.las
Number of args = 2!
Argv[1] = cucoriedky.las!
Strlen(argv[1]) = 14!
wonder-sk commented 2 years ago

On QGIS side everything seems to be correct: here in QGIS code is where the untwine input file gets converted https://github.com/qgis/QGIS/blob/master/src/providers/pdal/qgspdaleptgenerationtask.cpp#L85 - from QString which is unicode string representation, using QString::toStdString() which converts it to utf-8 encoded std::string (https://doc.qt.io/qt-5/qstring.html#toStdString).

abellgithub commented 2 years ago

Here is the test program I'm using:

#ifndef _WIN32
#include <unistd.h>
#endif

#include <iostream>

#include "QgisUntwine.hpp"

int main()
{
    untwine::QgisUntwine::StringList files;
    untwine::QgisUntwine::Options options;
    std::string exe = "C:\\Users\\andre\\untwine\\build\\untwine.exe";

    untwine::QgisUntwine api(exe);

    std::vector<unsigned char> funnycVec = { 0xc4, 0x8d };
    std::string funnyc(funnycVec.begin(), funnycVec.end());
    std::string v8string { "C:\\Users\\andre\\untwine\\build\\" + funnyc + "\\" + funnyc + ".las" };
    files.push_back(v8string);
    bool ok = api.start(files, "./out", options);
    if (! ok)
    {
        std::cerr << "Couldn't start '" << exe << "!\n";
        exit(-1);
    }

    while (true)
    {
#ifdef _WIN32
        Sleep(1000);
#else
        ::sleep(1);
#endif
        int percent = api.progressPercent();
        std::string s = api.progressMessage();
        std::cerr << "Percent/Msg = " << percent << " / " << s << "!\n";
        if (!api.running())
            break;
    }
    std::cerr << "Error = " << api.errorMessage() << "\n";
}

The file I'm using is "\<whatever>/č/č.las". You'll have to change the paths in the program above to suit your environment. Anyway, this test works fine for me and generates the expected EPT output.

wonder-sk commented 2 years ago

@abellgithub I think I finally understand what's going wrong. Thanks for your test program, it helped.

I have verified the input filename gets correctly passed as utf-8 to the child untwine process. The actual issue is with the output directory. In your test program it's just ./out, but in QGIS we use basename of the original file, so for hello.laz we would create output directory ept_hello.

If you modify your test program where it calls start(), you should be able to replicate the bug:

std::string outDir{ "./out_" + funnyc };
bool ok = api.start(files, outDir, options);

What happens is that untwine starts, it creates the output directory and sub-directories (ept-data, ept-hierarchy, temp), and it seems to start reading the input, but then fails to write the temporary files and aborts. My guess is that the problem could be in Writer::run() where it uses std::ofstream with the utf-8 path, but probably the path is interpreted as being encoded in the active code page? (which is cp1252 also for me)

A linked issue is that when this happens, there's no error message passed to the API client, which made the debugging quite time consuming, so it would be good if the error reporting could be fixed as well for this case...

abellgithub commented 2 years ago

Thanks. There are a couple of issues WRT UTF-16 that need to be addressed for temp files.

abellgithub commented 2 years ago

@wonder-sk : Any chance you could test this: https://github.com/hobuinc/untwine/tree/issue-56

wonder-sk commented 2 years ago

@abellgithub unfortunately I am unable to test it as I do not have full dev environment on windows (and it would take quite long to get it working). But I have looked at the code, and the changes look sensible. So I am happy if you merge all that to qgis branch, from where I would pull untwine source code to QGIS repo, so we get new windows builds with the fix sometime next week...

Thanks a lot for your work on the fix!

wonder-sk commented 2 years ago

I will close this as the original issue has been fixed, but there are more bugs related to unicode paths: #119 and #120