Open BartMassey opened 4 years ago
Thanks for reporting !
Lolcate could gain the ability to deal with non-utf8 pathnames just like Fd did a while ago.
However, I'd be more inclined to report non-utf8 issues and dangling symlinks issues back to the user for further investigation/treatment instead of trying to index them, since they shouldn't remain unsolved in the first place.
Something like
Found 18 non-UTF8 path names. See /tmp/lolcate.xxx1 for details.
Found 2 dangling symlinks. See /tmp/lolcate.xxx2 for details.
What do you think ?
I have files on my box with ISO-8859-1 names that are older than the Unicode standard. I also have files with names produced by disk errors. There's no reasonable way to "fix" these files: they just need to be indexed.
The main issue, which I haven't looked into yet, is what regex
does with non-utf8 strings, and whether it can make sense for this use case.
Incidentally I sumbled upon this gist from @ssokolow.
Unfortunalely the code doesn't seem to be Windows-compatible.
Only because I don't have a Windows machine and have so much else to do that I didn't have time to set up a modern.ie testing VM to make sure I was implementing the same transformation that ntfs-3g does for unpaired surrogates.
Poke me around Christmas when my brother is visiting and I'll plug a USB stick into his PC to generate the requisite test files and test the resulting code.
If someone else wants to implement it more quickly, you need to use cfg
to switch between use std::os::unix::ffi::{OsStrExt, OsStringExt};
and use std::os::windows::ffi::{OsStrExt, OsStringExt};
and to provide an alternative to as_bytes
and from_vec
using encode_wide
and from_wide
.
It's just the "What does Linux see when ntfs-3g encounters a filename from Windows containing un-paired surrogates? ...and how should I encode the data to ensure the transformation round-trips between a Linux build and a Windows build of the code?" that I'm blocked on.
Thank you very much, @ssokolow ! I don't have any Windows system at my disposal, so I'm very likely to respond to your invitation to poke you around Christmas !
Pathnames containing non-UTF8 characters are not indexed, but instead produce a warning during indexing.
I am investigating a fix, but it looks quite difficult, which is probably why it has not been done previously.