pkolaczk / fclones

Efficient Duplicate File Finder
MIT License
1.99k stars 79 forks source link

Unicode filename not recognized well on Windows + PowerShell #174

Open phu54321 opened 2 years ago

phu54321 commented 2 years ago
image
(base) D:\_BMS> fclones group --cache . | fclones link
[2022-11-11 01:52:29.708] fclones.exe:  info: Started grouping
[2022-11-11 01:59:50.710] fclones.exe:  info: Scanned 6646173 file entries
[2022-11-11 01:59:50.742] fclones.exe:  info: Found 6615777 (712.2 GB) files matching selection criteria
[2022-11-11 01:59:54.307] fclones.exe:  info: Found 6301756 (312.5 GB) candidates after grouping by size
[2022-11-11 01:59:54.724] fclones.exe:  info: Found 6301756 (312.5 GB) candidates after grouping by paths
[2022-11-11 02:33:01.393] fclones.exe:  info: Found 1626299 (120.8 GB) candidates after grouping by prefix
[2022-11-11 02:33:51.369] fclones.exe:  info: Found 1616049 (116.1 GB) candidates after grouping by suffix
[2022-11-11 02:44:27.566] fclones.exe:  info: Found 1430180 (104.8 GB) redundant files
[2022-11-11 02:45:38.262] fclones.exe:  info: Started deduplicating
[2022-11-11 02:45:38.267] fclones.exe: warn: Failed to read metadata of 'D:\_BMS\_etc\ultimate\[???] DistorteD MoonlighT\0.BGA.mpg': Failed to read metadata of 'D:\_BMS\_etc\ultimate\[???] DistorteD MoonlighT\0.BGA.mpg': 파일 이름, 디렉터리 이름 또는 볼륨 레이블 구문이 잘못되었습니다. (os error 123)
[2022-11-11 02:45:38.267] fclones.exe: warn: Failed to read metadata of 'D:\_BMS\BMS OF FIGHTERS\[2012] BOF2012\To Be Coontinued\[???] DistorteD MoonlighT\0.BGA.mpg': 파일 이름, 디렉터리 이름 또는 볼륨 레이블 구문이 잘못되었습니다. (os error 123)
[2022-11-11 02:45:38.267] fclones.exe: warn: Could not determine files to drop in group with hash 4d4f338df94fd9a1c7a1c481c05ac489 and len 187994116: Metadata of some files could not be obtained
[2022-11-11 02:45:38.272] fclones.exe:  info: Processed 3 files and reclaimed 676.6 MB space
[2022-11-11 02:45:38.272] fclones.exe: error: Failed to read file list: Invalid path     D:\\_BMS\\BMS OF FIGHTERS\\[2018] G2R2018\\overground\\Schizophrenicpatients\\[?????????????????????????] ??????????????\movie.mp4
: 120 when decoding DecodeError { kind: UnescapedSlash, index: 120, mat: "\\" } [index=\]

The actual directory name is [π/3] DistorteD MoonlighT and [縺輔°縺阪??縺帙▽縲?縺セ縺医□] 豁サ縺ォ縺溘縺ェ縺. (Yeah that's really a filename) It seems like fclones couldn't recognize Unicode names here.

Thanks

pkolaczk commented 2 years ago

Can you please attach the problematic report file with duplicates?

c22 commented 1 year ago

Here is some additional info for this issue (still present in 0.29.3)

Report file:

# Report by fclones 0.29.3
# Timestamp: 2023-02-09 14:04:44.822 +1100
# Command: 'C:\Users\c22\.cargo\bin\fclones.exe' group .
# Base dir: C:\\Users\\c22\\Desktop\\DupeTest
# Total: 12 B (12 B) in 3 files in 1 groups
# Redundant: 8 B (8 B) in 2 files
# Missing: 0 B (0 B) in 0 files
718ac45146ab06cd8f7d7c20c1ea6d66, 4 B (4 B) * 3:
    C:\\Users\\c22\\Desktop\\DupeTest\\🍔🍔🍔.txt
    C:\\Users\\c22\\Desktop\\DupeTest\\😊😊😊.txt
    C:\\Users\\c22\\Desktop\\DupeTest\\🤗🤗🤗.txt

Attempt to dedupe:

PS C:\Users\c22\Desktop\DupeTest> fclones.exe group . | fclones link
[2023-02-09 06:05:03.785] fclones.exe:  info: Started grouping
[2023-02-09 06:05:03.788] fclones.exe:  info: Scanned 4 file entries
[2023-02-09 06:05:03.789] fclones.exe:  info: Found 3 (12 B) files matching selection criteria
[2023-02-09 06:05:03.789] fclones.exe:  info: Found 2 (8 B) candidates after grouping by size
[2023-02-09 06:05:03.789] fclones.exe:  info: Found 2 (8 B) candidates after grouping by paths
[2023-02-09 06:05:03.790] fclones.exe:  info: Found 2 (8 B) candidates after grouping by prefix
[2023-02-09 06:05:03.791] fclones.exe:  info: Found 2 (8 B) candidates after grouping by suffix
[2023-02-09 06:05:03.791] fclones.exe:  info: Found 2 (8 B) redundant files
[2023-02-09 06:05:03.810] fclones.exe:  info: Started deduplicating
[2023-02-09 06:05:03.813] fclones.exe: warn: Failed to read metadata of 'C:\Users\c22\Desktop\DupeTest\????????????.txt': Failed to read metadata of 'C:\Users\c22\Desktop\DupeTest\????????????.txt': The filename, directory name, or volume label syntax is incorrect. (os error 123)
[2023-02-09 06:05:03.813] fclones.exe: warn: Failed to read metadata of 'C:\Users\c22\Desktop\DupeTest\????????????.txt': Failed to read metadata of 'C:\Users\c22\Desktop\DupeTest\????????????.txt': The filename, directory name, or volume label syntax is incorrect. (os error 123)
[2023-02-09 06:05:03.813] fclones.exe: warn: Failed to read metadata of 'C:\Users\c22\Desktop\DupeTest\????????????.txt': Failed to read metadata of 'C:\Users\c22\Desktop\DupeTest\????????????.txt': The filename, directory name, or volume label syntax is incorrect. (os error 123)
[2023-02-09 06:05:03.813] fclones.exe: warn: Could not determine files to drop in group with hash 718ac45066ab06cd8f7d7c20c1ea6d66 and len 4: Metadata of some files could not be obtained
[2023-02-09 06:05:03.813] fclones.exe:  info: Processed 0 files and reclaimed 0 B space

Result is that no files are de-duplicated.

I can possibly take a stab at a fix for this if I get some time.

pkolaczk commented 1 year ago

I tested both on Windows in CMD as well as in Wine and it handles the "hamburger" emojis just fine. However, one thing in common in the problems reported above is PowerShell.

https://github.com/PowerShell/PowerShell/issues/15871

Looks like powershell additionally reinterprets the encoding when the content is piped between two programs. So fclones link doesn't get the same content that was output by fclones group.

phu54321 commented 1 year ago

Weird issue. I'm okay with using cmd, so

pkolaczk commented 1 year ago

I'm not saying thete is nothing to do here. I'm thinking about a workaround. There are a few things I need to try. Maybe adding a BOM on Windows would help. Or I just escape all non ASCII characters on Windows (or as an option).

c22 commented 1 year ago

Good catch @pkolaczk. Turns out the issue was not what I first thought it would be, but your digging has helped me find a workaround that still allows a user to use PowerShell.

Run [Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8 first.

ie.

Without:

PS C:\Users\c22\Desktop\DupeTest> fclones group . | Out-Default
[2023-02-15 11:32:26.973] fclones.exe:  info: Started grouping
[2023-02-15 11:32:26.977] fclones.exe:  info: Scanned 4 file entries
[2023-02-15 11:32:26.977] fclones.exe:  info: Found 3 (12 B) files matching selection criteria
[2023-02-15 11:32:26.978] fclones.exe:  info: Found 2 (8 B) candidates after grouping by size
[2023-02-15 11:32:26.978] fclones.exe:  info: Found 2 (8 B) candidates after grouping by paths
[2023-02-15 11:32:26.988] fclones.exe:  info: Found 2 (8 B) candidates after grouping by prefix
[2023-02-15 11:32:26.989] fclones.exe:  info: Found 2 (8 B) candidates after grouping by suffix
[2023-02-15 11:32:26.990] fclones.exe:  info: Found 2 (8 B) redundant files
# Report by fclones 0.29.3
# Timestamp: 2023-02-15 11:32:26.991 +1100
# Command: 'C:\Users\c22\.cargo\bin\fclones.exe' group .
# Base dir: C:\\Users\\c22\\Desktop\\DupeTest
# Total: 12 B (12 B) in 3 files in 1 groups
# Redundant: 8 B (8 B) in 2 files
# Missing: 0 B (0 B) in 0 files
718ac45146ab06cd8f7d7c20c1ea6d66, 4 B (4 B) * 3:
    C:\\Users\\c22\\Desktop\\DupeTest\\🍔🍔🍔.txt
    C:\\Users\\c22\\Desktop\\DupeTest\\😊😊😊.txt
    C:\\Users\\c22\\Desktop\\DupeTest\\🤗🤗🤗.txt

With:

PS C:\Users\c22\Desktop\DupeTest> [Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8
PS C:\Users\c22\Desktop\DupeTest> fclones group . | Out-Default
[2023-02-15 11:32:37.765] fclones.exe:  info: Started grouping
[2023-02-15 11:32:37.770] fclones.exe:  info: Scanned 4 file entries
[2023-02-15 11:32:37.770] fclones.exe:  info: Found 3 (12 B) files matching selection criteria
[2023-02-15 11:32:37.771] fclones.exe:  info: Found 2 (8 B) candidates after grouping by size
[2023-02-15 11:32:37.771] fclones.exe:  info: Found 2 (8 B) candidates after grouping by paths
[2023-02-15 11:32:37.781] fclones.exe:  info: Found 2 (8 B) candidates after grouping by prefix
[2023-02-15 11:32:37.781] fclones.exe:  info: Found 2 (8 B) candidates after grouping by suffix
[2023-02-15 11:32:37.782] fclones.exe:  info: Found 2 (8 B) redundant files
# Report by fclones 0.29.3
# Timestamp: 2023-02-15 11:32:37.783 +1100
# Command: 'C:\Users\c22\.cargo\bin\fclones.exe' group .
# Base dir: C:\\Users\\c22\\Desktop\\DupeTest
# Total: 12 B (12 B) in 3 files in 1 groups
# Redundant: 8 B (8 B) in 2 files
# Missing: 0 B (0 B) in 0 files
718ac45146ab06cd8f7d7c20c1ea6d66, 4 B (4 B) * 3:
    C:\\Users\\c22\\Desktop\\DupeTest\\🍔🍔🍔.txt
    C:\\Users\\c22\\Desktop\\DupeTest\\😊😊😊.txt
    C:\\Users\\c22\\Desktop\\DupeTest\\🤗🤗🤗.txt

There seems to be a documented way to set your system to always use UTF-8 but it sounds like it could have potential compatibility issues.

I wonder if there is a way that fclones could a) detect it's running in PowerShell and b) set that property temporarily.

That might be asking too much, as this really seems more like a PowerShell issue.

Mikle-Bond commented 1 year ago

Can this theoretically be solved by introducing the -i|--input parameter for link and other commands to specify the file explicitly instead of piping it into stdin?

Mikle-Bond commented 1 year ago

Meanwhile, I found that Use-RawPipeline module helps. For anyone with a similar issues, here's a temporary workaround: https://github.com/GeeLaw/PowerShellThingies/tree/master/modules/Use-RawPipeline

Setting [Console]::OutputEncoding and [Console]::InputEncoding and $OutputEncoding, as well as changing the codepage didn't help me for some reason.