sreedevk / deduplicator

Filter, Sort & Delete Duplicate Files Recursively
MIT License
281 stars 15 forks source link

[Bug] "Error: path contains invalid UTF-8 characters" #58

Open Murmur opened 6 months ago

Murmur commented 6 months ago

Version: deduplicator 0.2.1 compiled from git master branch. System: Windows 10 Cmdline: deduplicator.exe --follow-links --json "c:/" > "c:\temp\deduplicator-report_C.txt"

Scannned an entire C drive but after running 2h30m gives an utf8 error. Error does not give detailed information about the filename and folder.

[00:21:59] 3232274 paths mapped   
[00:00:08] ###### 2692395/2692395 indexed files sizes  
[02:30:50] ###### 2613147/2613147 indexed files hashes   
Error: path contains invalid UTF-8 characters
sreedevk commented 6 months ago

@Murmur Thank you for testing! I will try recreating the test with invalid characters in file names and create a fix after investigation.

Murmur commented 6 months ago

@sreedevk Maybe a charset problem with win10 dos console > output redirect? App could use --output "c:/temp/report.json" argument and directly write an utf8 textfile?

Also would be helpfull to have a "modifiedts": <utcMillisTimestamp>, "modified": "yyyy-MM-ddThh:mm:ss" fields, then my own python script could easily create a filtered report for candidates for file deletion, sort by size/time/path/filename etc tricks.

Current json output:
[
  {
    "path": "\\\\?\\C:\\temp\\data.txt",
    "hash": "10834377068631730967",
    "size": 448
  },
  {
    "path": "\\\\?\\C:\\temp\\data - Copy.txt",
    "hash": "10834377068631730967",
    "size": 448
  }
]