ngirard / lolcate-rs

Lolcate -- A comically fast way of indexing and querying your filesystem. Replaces locate / mlocate / updatedb. Written in Rust.
GNU General Public License v3.0
293 stars 18 forks source link

Does not handle non-UTF8-encodable pathnames #20

Open BartMassey opened 4 years ago

BartMassey commented 4 years ago

Pathnames containing non-UTF8 characters are not indexed, but instead produce a warning during indexing.

I am investigating a fix, but it looks quite difficult, which is probably why it has not been done previously.

ngirard commented 4 years ago

Thanks for reporting !

Lolcate could gain the ability to deal with non-utf8 pathnames just like Fd did a while ago.

However, I'd be more inclined to report non-utf8 issues and dangling symlinks issues back to the user for further investigation/treatment instead of trying to index them, since they shouldn't remain unsolved in the first place.

Something like

Found 18 non-UTF8 path names. See /tmp/lolcate.xxx1 for details.
Found 2 dangling symlinks. See /tmp/lolcate.xxx2 for details.

What do you think ?

BartMassey commented 4 years ago

I have files on my box with ISO-8859-1 names that are older than the Unicode standard. I also have files with names produced by disk errors. There's no reasonable way to "fix" these files: they just need to be indexed.

The main issue, which I haven't looked into yet, is what regex does with non-utf8 strings, and whether it can make sense for this use case.

ngirard commented 4 years ago

Incidentally I sumbled upon this gist from @ssokolow.

Unfortunalely the code doesn't seem to be Windows-compatible.

ssokolow commented 4 years ago

Only because I don't have a Windows machine and have so much else to do that I didn't have time to set up a modern.ie testing VM to make sure I was implementing the same transformation that ntfs-3g does for unpaired surrogates.

Poke me around Christmas when my brother is visiting and I'll plug a USB stick into his PC to generate the requisite test files and test the resulting code.

If someone else wants to implement it more quickly, you need to use cfg to switch between use std::os::unix::ffi::{OsStrExt, OsStringExt}; and use std::os::windows::ffi::{OsStrExt, OsStringExt}; and to provide an alternative to as_bytes and from_vec using encode_wide and from_wide.

It's just the "What does Linux see when ntfs-3g encounters a filename from Windows containing un-paired surrogates? ...and how should I encode the data to ensure the transformation round-trips between a Linux build and a Windows build of the code?" that I'm blocked on.

ngirard commented 4 years ago

Thank you very much, @ssokolow ! I don't have any Windows system at my disposal, so I'm very likely to respond to your invitation to poke you around Christmas !

ngirard commented 4 years ago

For reference, here is a relevant discussion on r/rust.