Open jwilk opened 9 years ago
Thanks for the bug report.
pdf2djvu doesn't itself perform any conversions on the arguments.
The C runtime does covert from Unicode command-line to byte-based argv[]
, using the ANSI codepage as encoding.
If it does it wrong, as seem to be the case here, there's not much we can do about it.
chcp
doesn't help, because it only changes console codepage, not the ANSI codepage.
Anyway, I wrote a small test program that should show what's exactly going on here. Could you run it with "E:\ی.pdf"
as the argument, and paste the output?
Attachment: testencoding.zip
Source of the test program:
#include <stdio.h>
#include <sys/stat.h>
#include <windows.h>
int main(int argc, char **argv)
{
struct stat st;
int rc;
int i;
printf("GetACP() = %d\n", GetACP());
printf("GetConsoleOutputCP() = %d\n", GetConsoleOutputCP());
for (i = 1; i < argc; i++) {
printf("argv[%d] = \"", i);
const char *p = argv[i];
while (*p)
printf("\\x%02X", (unsigned char)*p++);
printf("\"\n");
rc = stat(argv[i], &st);
printf("stat(argv[%d]) = %d", i, rc);
if (rc != 0)
printf(" (%s)", strerror(errno));
printf("\n");
}
wchar_t **argvw;
int argcw;
argvw = CommandLineToArgvW(GetCommandLineW(), &argcw);
if (argvw == NULL) {
fprintf(stderr, "CommandLineToArgvW() failed\n");
return 1;
}
for (i = 1; i < argcw; i++) {
printf("argvw[%d] = L\"", i);
const wchar_t *p = argvw[i];
while (*p)
printf("\\u%04X", *p++);
printf("\"\n");
rc = wstat(argvw[i], &st);
printf("wstat(argvw[%d]) = %d", i, rc);
if (rc != 0)
printf(" (%s)", strerror(errno));
printf("\n");
}
return 0;
}
/* vim:set ts=4 sts=4 sw=4 et:*/
Comment submitted by 40a
at Bitbucket:
Thank you. I see. AFAIK non of the Microsoft defined codepages contain the character "ی".
Here is the output:
F:\Downloads>testencoding.exe "E:\ی.pdf"
GetACP() = 1256
GetConsoleOutputCP() = 720
argv[1] = "\x45\x3A\x5C\xED\x2E\x70\x64\x66"
stat(argv[1]) = -1 (No such file or directory)
argvw[1] = L"\u0045\u003A\u005C\u06CC\u002E\u0070\u0064\u0066"
wstat(argvw[1]) = 0
U+06CC (ARABIC LETTER FARSI YEH) cannot be represented in CP1256, which is your ANSI codepage. Apparently the C runtime converts the character to 0xED, which is U+064A (ARABIC LETTER YEH).
That's going to be tough to fix. :-\
But I'll try at least improve the error message.
Issue reported by
40a
at Bitbucket:I'm using pdf2djvu.exe on windows 8.1. I have noticed that for all pdf files that contain "ی" character (U+06CC) in their names I get the following error:
When running the filepath directly ("E:\ی.pdf") it works fine and causes the file to be opened in Adobe Reader. So I suspect that the issue is caused by the way pdf2djvu decodes its arguments.
I already have tried using the
chcp 65001
command to change the cmd's codepage to utf-8, but still the same error, only the shape of the mojibake in the error message changes.Currently I have found no way around this but to rename the file to something else and then do the conversion.