libjxl / libjxl

JPEG XL image format reference implementation
BSD 3-Clause "New" or "Revised" License
2.6k stars 251 forks source link

Windows cjxl.exe cannot handle Japenes/Chinese/Russian unicode filenames #683

Open Joachim-Otahal opened 2 years ago

Joachim-Otahal commented 2 years ago

I have a log of pictures with russian, japanese and chinese filenames. Examples: 3d-art-красивые-картинки-art-пираты-6497178.jpeg or фото-Природа-озеро-горы-6946340.jpeg When I rename them using the CMD shell: ren фото-Природа-озеро-горы-6946340.jpeg 00000temp.jpeg I can convert them and rename it back afterwards. See attached screenshot. It does not happen für typical WESTERN unicode filenames with German Umlaut like ÖÄÜ.

JPEG-XL-cjxl exe-unicode-issue

That exact source picture on the internet: http://joyreactor.cc/post/4943227

Cannot test that for *nix, don't have a linux at home any more (last was about 2015). But since the normale cmd copy / xcopy / robocopy and all other windows command line tools can handle it it should be possible for the windows version of CJXL too.

kind regards,

Joachim Otahal

jonsneyers commented 2 years ago

I assume Windows is doing some legacy thing here that makes fopen() not work on the strings that end up in argv if those contain certain kinds of non-ascii characters. I consider that a bug in Windows more than anything else, but if there's a workaround we can/should implement it. Any dev familiar with Windows peculiarities who can help with that?

Joachim-Otahal commented 2 years ago

For comparison - following command lines work fine in other well known programs: Opens the picture: "c:\Program Files\XnViewMP\xnviewmp.exe" X:\pics\joyreactor\3d-art-красивые-картинки-art-пираты-6497178.jpeg Creating a tiny video: c:\prog\ffmpeg\bin\ffmpeg.exe -i X:\pics\joyreactor\3d-art-красивые-картинки-art-пираты-6497178.jpeg -c:v libx265 c:\tmp\3d-art-красивые-картинки-art-пираты-6497178.mkv Converts successfully: "c:\Program Files\ImageMagick\magick.exe" convert "X:\pics\joyreactor\3d-art-красивые-картинки-art-пираты-6497178.jpeg" -resize 25% c:\tmp\3d-art-красивые-картинки-art-пираты-6497178.jpeg

So the solution must be out there already implemented countless times.

eddiezato commented 2 years ago

Hmm, for me the Russian names work fine. Maybe it has something to do with the OS.

Joachim-Otahal commented 2 years ago

Imagemagick works too, but I cannot set "effort" ? And does "Quality 100" really mean lossless? Example / documentation missing, or my google-fu is broken. And it may be behind when it comes to your tool regarding bugs.

Screenshot 2021-10-04 134028

Joachim-Otahal commented 2 years ago

Hmm, for me the Russian names work fine. Maybe it has something to do with the OS.

You seem to have russian locale installed in your Windows. I only have German/English, whereas imagemagic, ffmpeg and so on can candle it.

eddiezato commented 2 years ago

You seem to have russian locale installed in your Windows

This is true. My OS installed from the Eng ISO, but after that WU automatically installed the Russian language pack.

eddiezato commented 2 years ago

I haven't Japanese locale and cmd can't display the file name correctly, but cjxl works fine. The second screenshot is from Windows Terminal.

Joachim-Otahal commented 2 years ago

I haven't Japanese locale and cmd can't display the file name correctly, but cjxl works fine. The second screenshot is from Windows Terminal.

Your left screenshot is just a font problem. Select, for example, Lucida Console or Liberation Mono as font, and your CMD will have Japanese characters. Could you test it with a Russian filename? See the Joyreactor link in the first post as example. Maybe it is a Windows 11 problem on my machine here.... But then: The same is on Server 2022....

eddiezato commented 2 years ago

is just a font problem

I use Cascadia Code, with Consolas the Japanese text is displayed the same way.

Could you test it with a Russian filename?

I have a Russian locale installed. https://github.com/libjxl/libjxl/issues/683#issuecomment-933402119

Joachim-Otahal commented 2 years ago

To rule out an Server 2022 or Windows 11 issue I installed a plain German Server 2019 test version in Hyper-V (+ run windows updates) to check on that. Server 2019 is equal to Windows 10 1809. It is still the same problem: Screenshot 2021-10-05 010727

The virtual machine is configuration version 8.0, so any Windows 10 1607 (aka Anniversary) or later and any Server 2016 or later will be able to import it. It is not activated, will run the usual 180 days eval time. The Adminpassword is empty, so it will log on automatically. Download size is ~10 GB. https://joumxyzptlk.de/tmp/S2019-JPEGXL-CJXL.7z Tell me when I can remove it from my site :).

TheDecryptor commented 2 years ago

What happens if you run chcp 65001 in the terminal first before cjxl/djxl?

Joachim-Otahal commented 2 years ago

No change. Whether Windows 11 or the virtual plain Server 2019: image

IMHO it should be possible to solve, and has been solved by others. Imagmagick, ffmpeg, xcopy, xnview and so on have no problems. This comment also means: I can remove that virtual test machine from my webserver since no one bothered to check it there.

vtorri commented 2 years ago

use "UNICODE" versions of functions on Windows :

Joachim-Otahal commented 2 years ago

Oh, just that simple? Then I hope this will be incorporated in their own .EXE releases soon!

use "UNICODE" versions of functions on Windows :

Oh, just that simple? Then I hope this will be incorporated in their own .EXE releases soon!

jonsneyers commented 2 years ago

Is there a way to do it just with compiler flags (i.e. so filenames are passed to main() and interpreted in fopen() as UTF-8), or do you always have to use the UTF-16 wmain/_wfopen variants to make unicode filenames work in Windows? If it's the latter, it's not that simple, since it means we have to implement UTF-16 wchar variants of the option parsing in cjxl/djxl...

TheDecryptor commented 2 years ago

According to Microsoft you can opt into UTF-8 by default by either specifying it via an application manifest or setting the minimum target version to Windows 10 1903, but that won't help users on an earlier version.

Users can also force UTF-8 system wide, which is a big hammer but does solve the issue (Unlike my earlier chcp suggestion)

Joachim-Otahal commented 2 years ago

Users can also force UTF-8 system wide, which is a big hammer but does solve the issue (Unlike my earlier chcp suggestion)

Neither is necessary on the Windows side since Windows Vista. You don't need to change your region, regionale, or codepage - I can simply stay in my german OS and use Chinese, Russian, Korean whatever filenames within the .CMD window, within the same filename - including Farsi with its right-to-left writing style. This part of the code in CJXL seems to be over a decade behind. Though I personally would say: This specific issue has less priority over other issues since there are easy to workarounds in .CMD or .PS1 scripts. Examples of higher priority: Many pictures require that we have to play with -P -g and a huge range of -I values to get better compression than the input .PNG / .GIF instead of cjxl being able to do this automatically with a high effort setting. Or the huge amount of .GIF files, for example from older dilbert strips, which cjxl cannot compress better and blow up by 20% or more, no matter what options you try.

mo271 commented 2 years ago

TODO: check if this is can be reproduced with cjxl_ng.

Joachim-Otahal commented 2 years ago

TODO: check if this is can be reproduced with cjxl_ng.

Got a binary link? https://github.com/libjxl/libjxl/releases does not have a cjxl_ng.exe. Trying the latest from https://artifacts.lucaversari.it/libjxl/libjxl/latest/, which contains a cjxl_ng.exe, which throws this error: image

mo271 commented 2 years ago

Thanks for trying this!

Joachim-Otahal commented 2 years ago

Thanks for trying this!

If course I try. My current workaround in my powershell script is to create a hardlink in the filesysten with non-unicode name, encode that, and rename the result like the original filename (with .jxl of course).

Joachim-Otahal commented 1 year ago

Just for information: I've been using my powershell script to convert several 100'000 pictures to .JXL, including a workaround for this unicode problem. If there is interest I'll clean it up a bit and put it in my github repo.

Dogway commented 1 year ago

I don't recommend you to promote the use of JPEG XL while it isn't stable yet, people is degrading their "lossless" transcodes by signaling degraded dequantization coefficients. See here. Work is underway I think. 0c60630d16fcee56c6d8b306cd95cd4ecf0edc7d

jendalinda commented 1 year ago

I've already transcoded my entire collection of photos. There's no degradation, it's completely reversible. JXL is very convenient JPG packer.

Joachim-Otahal commented 1 year ago

@Dogway From the link you provide I see the problem with your source images. In my tests I never had such an issue, bit-comparison showed no difference when opened in either XnVviewMP or Affinity photo with subtracting the original from the .jxl. Your issue might be dependent on other factors . The way the original .jpg are encoded in your case(s), small library differences depending on the underlying OS and so on. Though that issue has nothing to do with the unicode-filename problem discussed here.

Dogway commented 1 year ago

I was only replying to: Just for information: I've been using my powershell script to convert several 100'000 pictures to .JXL, (...). If there is interest I'll clean it up a bit and put it in my github repo. And advising against promoting a software in beta that could damage people photos, if you add a warning, that's fine.

Joachim-Otahal commented 1 year ago

Thanks for trying this!

Any news on this, or a new build to test? Today's build from https://artifacts.lucaversari.it/libjxl/libjxl/latest/ still cannot handle unicode.

a84r7a3rga76fg commented 1 year ago

Just for information: I've been using my powershell script to convert several 100'000 pictures to .JXL, including a workaround for this unicode problem. If there is interest I'll clean it up a bit and put it in my github repo.

@Joachim-Otahal Yes please, I'd appreciate this. Been looking for a script like this.

Joachim-Otahal commented 1 year ago

Just for information: I've been using my powershell script to convert several 100'000 pictures to .JXL, including a workaround for this unicode problem. If there is interest I'll clean it up a bit and put it in my github repo.

@Joachim-Otahal Yes please, I'd appreciate this. Been looking for a script like this.

Working on it, making it "version 0.1" release ready. Currently adding, as v 0.2, a multi-thread variant with a maximum given number of threads. The multi-thread version is less verbose by design as the single thread version, but faster since everyone has more than one core. But I need a few days of testing before I put that online :D.

Joachim-Otahal commented 1 year ago

Here are two scripts, meant to drag and drop one or more directories on it, and it will start the conversion. They will DELETE the source if the ..JXL result is smaller and the conversion was without errors. Set $WhatIf to $true in the script if you don't trust my code. They include a workaround for the unicode issue, so mass converting Chinese/Korean/Persian/Russian/Japanese etc file names are not a problem any more. Default setting is "lossless with effort 8". Be so kind to report errors, but please do a good error report :D. https://github.com/Joachim-Otahal/PowerShell/tree/main/Jpeg-XL

scurest commented 6 months ago

This issue got a little sidetracked, but to return to what TheDecryptor said, here's a summary of the problem

Here's a tentative patch that adds a manifest. I have tested the {c,d}jxl-x64-windows-static binaries from building with this, and can confirm they fix this issue for me.

diff --git a/cmake/windows_utf8.manifest b/cmake/windows_utf8.manifest
new file mode 100644
index 00000000..dab929e1
--- /dev/null
+++ b/cmake/windows_utf8.manifest
@@ -0,0 +1,8 @@
+<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
+<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
+  <application>
+    <windowsSettings>
+      <activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
+    </windowsSettings>
+  </application>
+</assembly>
diff --git a/tools/CMakeLists.txt b/tools/CMakeLists.txt
index be1352cc..dd7d4ee6 100644
--- a/tools/CMakeLists.txt
+++ b/tools/CMakeLists.txt
@@ -332,6 +332,11 @@ foreach(BINARY IN LISTS TOOL_BINARIES)
     endif()
     set_target_properties(${BINARY} PROPERTIES LINK_FLAGS "${JXL_WASM_TOOLS_LINK_FLAGS}")
   endif()
+
+  # Attach manifest that tells Windows to use UTF-8 for eg. fopen
+  if(MSVC)
+    target_sources(${BINARY} PRIVATE "${CMAKE_CURRENT_LIST_DIR}/../cmake/windows_utf8.manifest")
+  endif()
 endforeach()

 install(TARGETS ${TOOL_BINARIES} RUNTIME DESTINATION "${CMAKE_INSTALL_BINDIR}")

It adds a cmake/windows_utf8.manifest file (taken from the Microsoft docs linked above), then uses cmake's target_sources to add that .manifest to all the tool binaries. Cmake knows how "to honor *.manifest source files with MSVC tools. Manifest files named as sources of .exe and .dll targets will be merged with linker-generated manifests and embedded in the binary."

Caveats are:

edit: Here's a build artifact if anyone else wants to test it.

jendalinda commented 6 months ago

Here's another workaround, just using a short 8.3 file name. Short file names contain only MS-DOS compatible characters. This works also with other programs that are not compatible with Windows unicode.

cjxl.exe %~s1 temp.jxl %2 %3 %4 %5 %6 %7 %8 %9
move temp.jxl "%~dpn1.jxl"

For more info about these macros see for /?

This trick doesn't work on exFAT filesystems though. exFAT seems not to support short file names anymore.

vtorri commented 6 months ago

imho, you should use UTF-16 to handle non alphabetic languages on Windows

note that for static string, L"foo" will be a wchar_t *, so just define L to nothing on unix

kmilos commented 6 months ago

Here's a tentative patch that adds a manifest. I have tested the {c,d}jxl-x64-windows-static binaries from building with this, and can confirm they fix this issue for me.

This is definitely the way to go forward, but needs to be further reworked a little bit - using .manifest file directly as source only works for MSVC toolchain as you already pointed out, while the MinGW toolchain does not understand it. Instead, one just needs to wrap this .manifest in a .rc file, and then both MSVC and MinGW can deal with it.

See e.g. how libavif did it recently, inspired by RStudio.

imho, you should use UTF-16 to handle non alphabetic languages on Windows

This was the "old" way of doing it, UTF-8 should be preferred now as it is fully cross-platform and less code.

scurest commented 6 months ago

@kmilos Oh, great! So that gives a patch like this?

diff --git a/tools/CMakeLists.txt b/tools/CMakeLists.txt
index be1352cc..3e3df67a 100644
--- a/tools/CMakeLists.txt
+++ b/tools/CMakeLists.txt
@@ -332,6 +332,23 @@ foreach(BINARY IN LISTS TOOL_BINARIES)
     endif()
     set_target_properties(${BINARY} PROPERTIES LINK_FLAGS "${JXL_WASM_TOOLS_LINK_FLAGS}")
   endif()
+
+  # Attach manifest that tells Windows to use UTF-8 for eg. fopen
+  if(WIN32)
+    # MinGW doesn't have a manifest tool (mt.exe), so we need to wrap
+    # utf8.manifest in a resource-definition script (.rc file).
+    target_sources(${BINARY} PRIVATE windows_utf8.rc)
+
+    # The MSVC linker (link.exe) fails with the following error if we
+    # don't use the /MANIFEST:NO linker option:
+    #   /MANIFEST:EMBED,ID=1" failed (exit code 1123) with the following
+    #   output:
+    #   CVTRES : fatal error CVT1100: duplicate resource.  type:MANIFEST,
+    #   name:1, language:0x0409
+    if(NOT MINGW)
+      target_link_options(${BINARY} PRIVATE /MANIFEST:NO)
+    endif()
+  endif()
 endforeach()

 install(TARGETS ${TOOL_BINARIES} RUNTIME DESTINATION "${CMAKE_INSTALL_BINDIR}")
diff --git a/tools/windows_utf8.manifest b/tools/windows_utf8.manifest
new file mode 100644
index 00000000..dab929e1
--- /dev/null
+++ b/tools/windows_utf8.manifest
@@ -0,0 +1,8 @@
+<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
+<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
+  <application>
+    <windowsSettings>
+      <activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
+    </windowsSettings>
+  </application>
+</assembly>
diff --git a/tools/windows_utf8.rc b/tools/windows_utf8.rc
new file mode 100644
index 00000000..411efb7e
--- /dev/null
+++ b/tools/windows_utf8.rc
@@ -0,0 +1,3 @@
+#include <winuser.h>
+
+CREATEPROCESS_MANIFEST_RESOURCE_ID RT_MANIFEST "windows_utf8.manifest"

Can confirm this still works when building with MSVC. Can anybody easily test the MSYS2 binaries?

edit: Should it be if(NOT MINGW) or if(MSVC)?

kmilos commented 6 months ago

Something like that, great! No time to check on MSYS2 yet though...

Should it be if(NOT MINGW) or if(MSVC)?

Since you're inside the if(WIN32) I don't think it matters, but maybe if(MSVC) is clearer - note that MSVC is also defined when compiling w/ clang-cl.

Ukhryuk-Hai commented 6 months ago
  • The user can set this globally for the whole system with the "Beta: Use Unicode UTF-8 for worldwide language support" setting in the Control Panel.

I tested this method and it is unacceptable for practical use, because then in localized programs the font is replaced with strange characters.

kmilos commented 6 months ago

I tested this method and it is unacceptable for practical use

Luckily it's not needed to change the system-wide locale w/ the above patch.

scurest commented 6 months ago

Okay, that won't build with mingw. I get

[399/443] Building RC object tools/CMakeFiles/jxlinfo.dir/windows_utf8.rc.obj
FAILED: tools/CMakeFiles/jxlinfo.dir/windows_utf8.rc.obj 
D:\a\_temp\msys64\mingw64\bin\windres.exe
  -O coff
  -DFJXL_ENABLE_AVX512=0
  -DHWY_DISABLED_TARGETS="(HWY_SSSE3|HWY_AVX3|HWY_AVX3_SPR|HWY_AVX3_ZEN4|HWY_SVE|HWY_SVE2|HWY_SVE_256|HWY_SVE2_128|HWY_RVV)"
  -D__DATE__=\"redacted\"
  -D__TIMESTAMP__=\"redacted\"
  -D__TIME__=\"redacted\"
  -I D:/a/libjxl/libjxl
  -I D:/a/libjxl/libjxl/third_party/highway
  -I D:/a/libjxl/libjxl/build/lib/include
  D:/a/libjxl/libjxl/tools/windows_utf8.rc
  tools/CMakeFiles/jxlinfo.dir/windows_utf8.rc.obj
'HWY_AVX3' is not recognized as an internal or external command, operable program or batch file.

https://github.com/scurest/libjxl/actions/runs/8358808655/job/22880878058?pr=1#step:7:421

My best guess is the quotes in -DHWY_DISABLED_TARGETS=" should be escaped, like for -D__DATE__, etc. Not sure why they aren't. Possibly related: https://stackoverflow.com/questions/45248034/cmake-generated-mingw-makefile-has-quoting-errors

kmilos commented 6 months ago

that won't build with mingw

Confirmed the same error locally - MSYS2 UCRT64 bash shell, Ninja generator...

I guess windres is more fussy than other frontends...

Theo1996 commented 6 months ago

I am using W7 jap locale and it can handle japanese but not chinese,russian or korean or the characters " ! " and " ' ",havent tried others, just sharing my case.

jendalinda commented 6 months ago

Try the simple trick with short file names as I mentionem above, this should work in Windows 7.

Theo1996 commented 6 months ago

@jendalinda It changes the outputs file's name, so it doesnt really fix. I just delete the characters that cause the error if I really want to convert it. And without putting short names in the output name path expansion it just cant write to the file.

jendalinda commented 6 months ago

@Theo1996 The file must be renamed after the conversions, taht's what move does.

Theo1996 commented 6 months ago

@jendalinda I understand that.

kmilos commented 6 months ago

@scurest Have you tried also including

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 9b74537..e8e19dd 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -300,7 +300,7 @@ else ()
     if (NOT JPEGXL_ENABLE_SIZELESS_VECTORS)
       set(HWY_DISABLED_TARGETS "${HWY_DISABLED_TARGETS}|HWY_SVE|HWY_SVE2|HWY_SVE_256|HWY_SVE2_128|HWY_RVV")
     endif()
-    add_definitions(-DHWY_DISABLED_TARGETS=\(${HWY_DISABLED_TARGETS}\))
+    add_definitions(-DHWY_DISABLED_TARGETS="\(${HWY_DISABLED_TARGETS}\)")
   endif()

   # Machine flags.

Edit: Even better, drop the deprecated add_definitions and use add_compile_options w/ a generator that skips it for RC language, something like

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 9b74537..1eee58e 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -300,7 +300,7 @@ else ()
     if (NOT JPEGXL_ENABLE_SIZELESS_VECTORS)
       set(HWY_DISABLED_TARGETS "${HWY_DISABLED_TARGETS}|HWY_SVE|HWY_SVE2|HWY_SVE_256|HWY_SVE2_128|HWY_RVV")
     endif()
-    add_definitions(-DHWY_DISABLED_TARGETS=\(${HWY_DISABLED_TARGETS}\))
+    add_compile_options($<$<NOT:$<COMPILE_LANGUAGE:RC>>:-DHWY_DISABLED_TARGETS=\(${HWY_DISABLED_TARGETS}\)>)
   endif()

   # Machine flags.
scurest commented 6 months ago

@kmilos It builds, I don't know if it works. Would be great if someone who uses MinGW or Windows can test it.

kmilos commented 5 months ago

@scurest That second approach w/ generator confirmed building and working in MSYS2 as well:

C:\msys64\opt\libjxl\bin>cjxl фото-Природа-озеро-горы-6946340.jpeg фото-Природа-озеро-горы-6946340.jxl -d 0 -e 8
JPEG XL encoder v0.10.2 aacecd24 [AVX2,SSE4,SSE2]
Note: Implicit-default for JPEG is lossless-transcoding. To silence this message, set --lossless_jpeg=(1|0).
Encoding [JPEG, lossless transcode, effort: 8]
Compressed to 2544.0 kB including container
Joachim-Otahal commented 5 months ago

Wohoo, can't wait until the next release, so I can kick my ugly workaround...

mo271 commented 4 days ago

@scurest Thanks for looking into this! Could you make a PR if you have something that fixes the problem?