olson-sean-k / wax

Opinionated and portable globs that can be matched against paths and directory trees.
https://glob.guide
MIT License
115 stars 10 forks source link

Support non-utf8 path? #26

Closed WindSoilder closed 2 years ago

WindSoilder commented 2 years ago

Refer to: https://github.com/nushell/nushell/issues/2987 If we create a file from the code:

  use std::ffi::OsString;
  use std::fs;
  use std::io;
  use std::os::unix::ffi::OsStringExt;

  fn main() -> io::Result<()> {
      let chars: Vec<u8> = (1..=46)
          .into_iter()
          .chain(48..=70)
          .chain(150..200)
          .collect();

      fs::create_dir(OsString::from_vec(chars.clone()))
  }

And run glob walk with the following code:

use wax::Glob as WaxGlob;

fn main() {
    let glob = WaxGlob::new("*").unwrap();
    let glob_results: Vec<_> = glob.walk(".").collect();
}

It makes the program panic, with the message says that: unexpected encoding: Utf8Error

olson-sean-k commented 2 years ago

Thanks for reporting this!

This is definitely a bug in Wax. I believe the issue is the code here, which uses from_os_str_lossy from the bstr crate. That function lossily converts byte sequences into UTF-8 text by replacing invalid bytes with Unicode replacement characters, but only on non-Unix platforms. Given that the example code you've shared is using the std::os::unix module, I'm guessing that the bytes are passed through unmodified to from_utf8.

I plan to remove the dependency on bstr, as it is hardly exercised at all. Rather than using from_os_str_lossy, the likely fix will be to (unconditionally) use OsStr::to_string_lossy instead.

Please note that Wax does not support paths that cannot be interpreted as UTF-8 text. However, this lack of support should only lead to two primary limitations: (1) glob expressions are UTF-8 strings, so non-UTF-8 literals cannot be represented or matched directly and (2) the matched text (captures) of non-literals may differ from the bytes of the matching path. Panics like this should not occur.

olson-sean-k commented 2 years ago

I just took a look at nushell/nushell#2987. Is the ls command in this context using Glob::walk?. Since there is no reported panic, it seems like it is not, but I'm not sure. If it is, I'd recommend using the walkdir crate instead unless the ls command also accepts or otherwise uses glob expressions other than * and ** (or possibly if nushell is already using Wax elsewhere; Glob::walk is currently implemented atop walkdir).

WindSoilder commented 2 years ago

Please note that Wax does not support paths that cannot be interpreted as UTF-8 text. However, this lack of support should only lead to two primary limitations: (1) glob expressions are UTF-8 strings, so non-UTF-8 literals cannot be represented or matched directly and (2) the matched text (captures) of non-literals may differ from the bytes of the matching path. Panics like this should not occur.

I'm fine with that, if * can match non utf-8 path, everything is fine

I just took a look at https://github.com/nushell/nushell/issues/2987. Is the ls command in this context using Glob::walk?

Yeah, that issue doesn't mention about Glob::walk. But nushell has a glob command which make use of Glob::walk.

olson-sean-k commented 2 years ago

This has been fixed in 5263d5a. That change removes the bstr dependency and unconditionally uses OsStr::to_string_lossy when constructing a CandidatePath from an OsStr or Path.