kopia / kopia

Cross-platform backup tool for Windows, macOS & Linux with fast, incremental backups, client-side end-to-end encryption, compression and data deduplication. CLI and GUI included.
https://kopia.io
Apache License 2.0
7.4k stars 381 forks source link

Non-Unicode filenames are lost #1764

Open yotann opened 2 years ago

yotann commented 2 years ago

When using a UTF-8 locale on Linux, Kopia gets confused by filenames that aren't valid UTF-8. It replaces the non-UTF-8 bytes with the replacement character U+FFFD, making it impossible to recover the original filename. Ideally, Kopia would preserve the original bytes of the filename, but if nothing else it should log an error message.

$ echo $LANG
en_US.UTF-8
$ touch $'\xfe'
$ touch $'\xff'
$ kopia snapshot create .
Snapshotting user@host:/tmp/test ...
 * 0 hashing, 1 hashed (0 B), 0 cached (0 B), uploaded 198 B, estimating...
Created snapshot with root k66fef850bdb93bbb59883c9a3598943c and ID ad11d536043bebc02cf947d46c87929a in 0s
$ kopia content show -j k66fef850bdb93bbb59883c9a3598943c
{
  "stream": "kopia:directory",
  "entries": [
    {
      "name": "\ufffd",
      "type": "f",
      "mode": "0644",
      "mtime": "2022-02-20T22:22:11.374460959-06:00",
      "uid": 1000,
      "gid": 100,
      "obj": "7b5e7b718fdf4d06e0e7e7a8d2c12894"
    },
    {
      "name": "\ufffd",
      "type": "f",
      "mode": "0644",
      "mtime": "2022-02-20T22:20:09.030978745-06:00",
      "uid": 1000,
      "gid": 100,
      "obj": "7b5e7b718fdf4d06e0e7e7a8d2c12894"
    }
  ],
  "summary": {
    "size": 0,
    "files": 2,
    "symlinks": 0,
    "dirs": 1,
    "maxTime": "2022-02-20T22:22:11.374460959-06:00",
    "numFailed": 0
  }
}
yotann commented 2 years ago

Both Borg and Restic handle this correctly (preserving the bytes). I'm not sure exactly how they store filenames.

yotann commented 2 years ago

Options that occur to me:

jkowalski commented 2 years ago

Thanks for the report. We should definitely fix that, i'm wondering how this may behave in Windows, which uses Unicode filenames only.