ggilder / bitrot

MIT License
2 stars 2 forks source link

Normalize Unicode combining characters in paths #30

Closed ggilder closed 9 years ago

ggilder commented 9 years ago

Different filesystems may represent the same file name with different Unicode characters. For instance, on my Linux ext4 system, the name "ö" is represented with the character U+00F6 (LATIN SMALL LETTER O WITH DIAERESIS). In contrast, on MacOS, it is represented using decomposed form: U+006F (LATIN SMALL LETTER O) followed by U+0308 (COMBINING DIAERESIS).

Without normalization, paths containing these characters will be incorrectly interpreted as added/deleted when moved to a different filesystem, as the lookup in the manifest map is done by byte content rather than normalized string.

Using Unicode NFC, provided by the unicode/norm package, we can always store the normalized form of the file path and avoid these issues.

ggilder commented 9 years ago

@faun

faun commented 9 years ago

Seems like a reasonable strategy.