Windows: "PDF error: Couldn't open file" with some unicode filenames

jwilk-archive / pdf2djvu

PDF to DjVu converter

GNU General Public License v2.0

94 stars 17 forks source link

Windows: "PDF error: Couldn't open file" with some unicode filenames #111

Open jwilk opened 9 years ago

jwilk commented 9 years ago

Issue reported by 40a at Bitbucket:

I'm using pdf2djvu.exe on windows 8.1. I have noticed that for all pdf files that contain "ی" character (U+06CC) in their names I get the following error:

>>>F:\Software\Media\PDF\pdf2djvu-0.8.2/pdf2djvu.exe --output="E:\out.djvu" "E:\ی.pdf"
PDF error: Couldn't open file 'E:\ÙŠ.pdf': No such file or directory.
Unable to load document

>>>"E:\ی.pdf"

>>>

When running the filepath directly ("E:\ی.pdf") it works fine and causes the file to be opened in Adobe Reader. So I suspect that the issue is caused by the way pdf2djvu decodes its arguments.

I already have tried using the chcp 65001 command to change the cmd's codepage to utf-8, but still the same error, only the shape of the mojibake in the error message changes.

Currently I have found no way around this but to rename the file to something else and then do the conversion.

jwilk commented 9 years ago

Thanks for the bug report.

pdf2djvu doesn't itself perform any conversions on the arguments. The C runtime does covert from Unicode command-line to byte-based argv[], using the ANSI codepage as encoding. If it does it wrong, as seem to be the case here, there's not much we can do about it.

chcp doesn't help, because it only changes console codepage, not the ANSI codepage.

Anyway, I wrote a small test program that should show what's exactly going on here. Could you run it with "E:\ی.pdf" as the argument, and paste the output?

Attachment: testencoding.zip

jwilk commented 9 years ago

Source of the test program:

#include <stdio.h>
#include <sys/stat.h>
#include <windows.h>

int main(int argc, char **argv)
{
    struct stat st;
    int rc;
    int i;
    printf("GetACP() = %d\n", GetACP());
    printf("GetConsoleOutputCP() = %d\n", GetConsoleOutputCP());
    for (i = 1; i < argc; i++) {
        printf("argv[%d] = \"", i);
        const char *p = argv[i];
        while (*p)
            printf("\\x%02X", (unsigned char)*p++);
        printf("\"\n");
        rc = stat(argv[i], &st);
        printf("stat(argv[%d]) = %d", i, rc);
        if (rc != 0)
            printf(" (%s)", strerror(errno));
        printf("\n");
    }
    wchar_t **argvw;
    int argcw;
    argvw = CommandLineToArgvW(GetCommandLineW(), &argcw);
    if (argvw == NULL) {
        fprintf(stderr, "CommandLineToArgvW() failed\n");
        return 1;
    }
    for (i = 1; i < argcw; i++) {
        printf("argvw[%d] = L\"", i);
        const wchar_t *p = argvw[i];
        while (*p)
            printf("\\u%04X", *p++);
        printf("\"\n");
        rc = wstat(argvw[i], &st);
        printf("wstat(argvw[%d]) = %d", i, rc);
        if (rc != 0)
            printf(" (%s)", strerror(errno));
        printf("\n");
    }
    return 0;
}

/* vim:set ts=4 sts=4 sw=4 et:*/

jwilk commented 9 years ago

Comment submitted by 40a at Bitbucket:

Thank you. I see. AFAIK non of the Microsoft defined codepages contain the character "ی".

Here is the output:

F:\Downloads>testencoding.exe "E:\ی.pdf"
GetACP() = 1256
GetConsoleOutputCP() = 720
argv[1] = "\x45\x3A\x5C\xED\x2E\x70\x64\x66"
stat(argv[1]) = -1 (No such file or directory)
argvw[1] = L"\u0045\u003A\u005C\u06CC\u002E\u0070\u0064\u0066"
wstat(argvw[1]) = 0

jwilk commented 9 years ago

U+06CC (ARABIC LETTER FARSI YEH) cannot be represented in CP1256, which is your ANSI codepage. Apparently the C runtime converts the character to 0xED, which is U+064A (ARABIC LETTER YEH).

That's going to be tough to fix. :-\

But I'll try at least improve the error message.