rust-lang / glob

Support for matching file paths against Unix shell style patterns.
http://doc.rust-lang.org/glob
Apache License 2.0
468 stars 76 forks source link

glob() should take a Path not str #78

Open remram44 opened 5 years ago

remram44 commented 5 years ago

Paths on UNIX are bytes, not unicode. Using str means that all of them can't be represented.

In particular, on systems set with a different locale, or when using disks that were created in a different locale, or simply because of some glitch somewhere, files using non-UTF-8 byte sequence can totally exist.

Not only is it not that much of a corner-case, but the standard library already dealt with that by providing the Path type. AsRef<Path> can also be used to allow users to still pass in string slices and string literals.

Related to #23 which is not about the interface but a bug in the internals.

BurntSushi commented 5 years ago

I'm not sure there is anyone actively working on this crate, and fixing this bug probably requires non-trivial changes. You might consider using globset instead.

remram44 commented 5 years ago

Oh, sorry about the noise :sweat_smile: It showed up in the "GitHub Explore" mailing-list "based on your public repository contributions".

Thanks for the pointer.

remram44 commented 5 years ago

globset seems to have the same problem, Glob::new() takes a string. This works fine for Americans and on UTF-8 machines but make it impossible to pass valid Path values to it.

[edit: and globset doesn't have a dedicated issue tracker]

BurntSushi commented 5 years ago

No, globset does not suffer from the same problem. globset does require valid UTF-8 for the pattern, but it's matching function can accept any path.

[edit: and globset doesn't have a dedicated issue tracker]

It might some day, but you can just file bugs against ripgrep with globset mentioned in the issue.

If you have a use case for specifying a pattern that isn't valid UTF-8, then I would certainly be happy to have a bug for that, but please provide a real example, because it's not actually clear what the intended semantics should be.

Also, the focus on "Americans" is completely unnecessary. UTF-8 isn't specific to Americans.

remram44 commented 5 years ago

On a ISO-8859-15 system, I might want to match Glob::new(b"d\xE9cembre 2018 - *.jpg"). Or I might want to pass arguments from args_os() (without panicking on non-UTF-8 filenames) or clap.

I might also just want to build a pattern from an existing Path, for example to find rsync-style temp files (.<filename>.randomstring e.g. .cv_rémi.pdf.GV4H3). Path.file_name() correctly returns &OsStr which I can easily manipulate, but I can't feed it back to Glob::new().

BurntSushi commented 5 years ago

Thanks for the example. I filed an issue and turned your example into code: https://github.com/BurntSushi/ripgrep/issues/1250