pkolaczk / fclones

Efficient Duplicate File Finder
MIT License
1.87k stars 70 forks source link

File name encoding error on Windows #110

Closed SuzukiHonoka closed 2 years ago

SuzukiHonoka commented 2 years ago

The file name which contains Chinese or Japanese, etc except English can not be displayed correctly in windows. Haven't tested in Linux though. Please take a look, thanks.

37bcef91be35e03883d75bb3d905704f, 65690105 B (65.7 MB) * 2:
    D:\\Music\\RADWIMPS - 鍓嶅墠鍓嶄笘.flac
SuzukiHonoka commented 2 years ago

Tested in Linux envirment, works fine.

pkolaczk commented 2 years ago

Not sure I understand. The output file should be valid UTF-8. Are you viewing the file with a UTF-8 capable text file viewer?

What you pasted looks like some Chinese characters, but I don't know Chinese so I can't tell what's wrong with them.

Can you attach the file youre getting?

SuzukiHonoka commented 2 years ago

The text viewer is UTF-8 capable for sure. I guess it is related to the windows default encoding fault. It may use "GBK" for encoding file names, not UTF8. The following is the normal output.

37bcef91be35e03883d75bb3d905704f, 65690105 B (65.7 MB) * 2:
    D:\\Music\\RADWIMPS - 前前前世.flac
pkolaczk commented 2 years ago

Can you attach the file? I need the original file, not interpreted by GitHub post formatter.

Anyway, fclones is not supposed to produce a GBK or any other language-specific local encoding in its reports, because that would kill portability of the reports.

It uses UTF-8 (actually STFU-8, which is UTF-8 with additional escaping of non-UTF characters), so it should be able to represent any character representable in Windows, regardless of your locale settings.

SuzukiHonoka commented 2 years ago

original file uploaded

https://mega.nz/file/GwEniA4C#ZqUB37GXSkm4au47qIxl7rSi9rbA_u2pU6OxX4pcxMo

pkolaczk commented 2 years ago

No, I didn't mean the flac file, but the report file produced by fclones. fclones group -o send_me_this_file.txt ...

SuzukiHonoka commented 2 years ago

diff.txt Here you are. The normal file name is 當山みれい (当山真玲) - わけあって.mp3.

pkolaczk commented 2 years ago

The report looks like UTF-16, which doesn't seem right. Ok, when I have a chance to grab a Windows machine, I'll give it a try. In the meantime, can you check if the json output is ok?

fclones group -f json -o diff.txt ...

pkolaczk commented 2 years ago

I cannot reproduce this.

I tested it with a Windows 10 machine, with NTFS filesystem, with non-ascii file names from Latin-2 charset (Polish) and the exact Japanese file name you reported, and the report looks ok.

Both -o report.txt and stdout redirect >report.txt produced a valid UTF-8 encoded report. All non-ascii characters displayed correctly in notepad, and the ones from Latin-2 charset displayed correctly also in CMD (however, the Japanese ones did not, because I the font doesn't support it). I also managed to transfer the file to my Linux workstation and it opened correctly there, with japanese characters looking right.

Your report seem to have totally messed up encoding, not just of the file names, but all the things in the report are encoded weirdly, as if converted to UTF-16 (which windows uses internally).

Can you provide more details on:

Also please try to generate the report with -o diff.txt option, don't use any stdout redirect to other programs. I can't see that option mentioned in your diff.txt file. I guess something external messed up the encoding after fclones produced it, by trying to transcode it to UTF-16, incorrectly (by probably assuming wrong input encoding).

SuzukiHonoka commented 8 months ago

Sorry for late reply, somehow I missed your messages. The latest fclone version works well on windows 11. Thanks again for developing this great tool! Best reguards.