rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
96.71k stars 12.5k forks source link

Tracking Issue for Path::file_prefix #86319

Open mbhall88 opened 3 years ago

mbhall88 commented 3 years ago

Feature gate: #![feature(path_file_prefix)]

This is a tracking issue for an implementation of the method std::path::Path::file_prefix. It is effectively a "left" variant of the existing file_stem method.

Public API

use std::path::Path;

let path = Path::new("foo.tar.gz");
assert_eq!(path.file_prefix(), Some("foo"));

Steps / History

Unresolved Questions

schwern commented 2 years ago

How about a corresponding file_suffix? That completes the set. file_stem/extension and file_prefix/file_suffix.

    pub fn file_suffix(&self) -> Option<&OsStr> {
        self.file_name().map(split_file_at_dot).and_then(|(_before, after)| Some(after))
    }
use std::path::Path;

let path = Path::new("foo.tar.gz");
assert_eq!(path.file_prefix(), Some("foo"));
assert_eq!(path.file_suffix(), Some("tar.gz"));

assert_eq!(path.file_stem(), Some("foo.tar"));
assert_eq!(path.extension(), Some("gz"));
mbhall88 commented 2 years ago

@schwern that is indeed a nice idea. I guess it would be best to get the go ahead from someone in the libs team?

I don't know how these tracking issues work. The PR for this feature was merged a while ago, but I have no idea if I'm supposed to do something with this issue? @dtolnay as you were the reviewer/merger of that PR are you able to answer this?

cyqsimon commented 2 years ago

Any progress on this? It's a very useful API when working with files like tarballs.

mbhall88 commented 2 years ago

@cyqsimon it is currently on nightly I believe. Not sure how long it takes to get onto stable though.

As for @schwern's suggestion, I guess that would have to be raised as a separate PR/issue and approved by someone on the libs team.

hustcer commented 2 years ago

How about a corresponding file_suffix? That completes the set. file_stem/extension and file_prefix/file_suffix.

Good point !

yescallop commented 2 years ago

One of my concerns is that a file_prefix or file_suffix method that simply slices the path wouldn't be very much useful, because it is common that users insert dots in the meaningful part of file names, such as File Version 3.5.4 (Copy).tar.gz (from Jacob Lifshay on Zulip Chat). The current file_stem and extension methods exist mainly because of File Associations on Windows and common practice on Linux, while it is relatively rare that one would need the exact part of file name before or after the first dot.

I'd suggest adding this to the unresolved questions, if possible.

lmapii commented 2 years ago

While it might be impossible to have a 100% correct implementation in my opinion it would already be helpful to simply have all rust crates behave in a known manner. E.g., there are now hundreds of rust crates out there dealing with files, e.g., matching filenames, all behaving a little different. It is impossible to know how a rust crate that you're including will behave.

Even having a method that might change over time and for which it is documented how it behaves in certain corner cases would be very, very useful. Compare it to the pathlib in python: It is not perfect but it is a beautiful piece of software to use.

EvanCarroll commented 2 years ago

while it is relatively rare that one would need the exact part of file name before or after the first dot.

It's not actually relatively rare. We went over this in voice chat. I gave you multiple reasons why someone would want to do this. In practice,

It's very popular with javascript (where there is no magic byte and everything is js or ts) .d.ts, .conf.json, .conf.js, .rc.json, rc.js, .esm.js, .cjs.js. But also with translations .ru.md etc. Are all common practices that we have to contend with right now many of these in the same project.


The current suggestion to get the functionality is,

.file_name().and_then(OsStr::to_str).map_or(false, |name| name.ends_with(".tar.gz"))

Which I think is a bit convoluted. I am down for doing this other ways, but I don't think something like this should is "rare" as far as path manipulation goes, or that it should require converting to-and-from str. I think the simple-most question here is

We could potentially find better ways to do all of this though. Like simply returning an iterator for file_parts where given foo.bar.tar.gz could be (list monster in graphic below),

The bottom line here is that when it comes to the file name, people will logically think of splitting on . and want names for all possible sets/strings of that list, (from the List Monster in Learn You a Haskell)

zj2Lb

Something about that API of returning an Iterator just seems simpler than arguing about whether an "extension" in foo.tar.gz is .tar.gz, tar.gz, .gz or gz.

yescallop commented 2 years ago

There might've been some misunderstandings here. Let's take this file name File Version 3.5.4 (Copy).tar.gz (from Jacob Lifshay) as an example. In this case, .file_prefix() returns File Version 3 while .file_suffix() returns .5.4 (Copy).tar.gz. If you need to determine if this file name ends with .tar.gz or not, you still need to do stuffs like

.and_then(OsStr::to_str).map_or(false, |name| name.ends_with(".tar.gz"))

IMO this is not optimal. Maybe we could add methods ends_with and strip_suffix on OsStr if this is sufficient for most use cases. But things are getting too complicated in here.

EvanCarroll commented 2 years ago

I can see the utility in better processing for File Version 3.5.4 (Copy).tar.gz. We could define a filename as

filename            = ( file_stem, extension? );
file_stem           = ( dotfile | non_dots );
dotfile             = ( "." + non_dots );
extension           = extension_segment+;
extension_segment   = ("." , valid_extension );
non_dots            = [^.]+?;
valid_extension     = [^\s.]+;

I believe the above would make such that

File Version 3.5.4 (Copy).tar.gz

Gets parsed as

file_stem = "File Version 3.5.4 (Copy)"
extension = ".tar.gz"

But it's probably still not sufficient because if the filename was instead File Version 3.5.4.tar.gz there would literally be nothing you could do if you desired file_name to be File Version 3.5.4. And I think that's fine. That's why I advocate returning an iterator for extension in the above comment, which is what Python does too. That makes processing easier.


IMO this is not optimal. Maybe we could add methods ends_with and strip_suffix on OsStr if this is sufficient for most use cases. But things are getting too complicated in here.

I agree with the OsStr suggestion. It doesn't make sense anyway not to have strip_suffix, strip_prefix, starts_with and ends_with on OsStr. But I still think having good pathing functionality in the library should be a goal, rather than to throw up our hands and say it's "too complex" or to pretend like it's rare to want to know an extension on a file. (Discussion topic on this suggestion opened https://internals.rust-lang.org/t/feature-request-starts-with-and-ends-with-on-osstr/16652)

mbhall88 commented 2 years ago

We could potentially find better ways to do all of this though. Like simply returning an iterator for file_parts where given foo.bar.tar.gz

I think this is probably the "correct" way to approach this. With such an iterator method we can easily implement any/all flavours of splitting up file names.

In borrowing from python's pathlib (which is very ergonomic), it may be as simple as implementing something akin to suffixes. Because with the method this issue tracks (file_prefix), and a new method suffixes, you have all parts.

Using the running example of the corner case file name

use std::path::Path;

let path = Path::new("path/to/File Version 3.5.4 (Copy).tar.gz");
let mut exts = path.suffixes()

assert_eq!(path.file_prefix(), Some("File Version 3"));
assert_eq!(exts.next(), Some(".5"));
assert_eq!(exts.next(), Some(".4 (Copy)"));
assert_eq!(exts.next(), Some(".tar"));
assert_eq!(exts.next(), Some(".gz"));
assert_eq!(exts.next(), None);

Although, I can also understand the argument that the following is more "complete"

use std::path::Path;

let path = Path::new("path/to/File Version 3.5.4 (Copy).tar.gz");
let mut parts = path.file_parts()

assert_eq!(parts.next(), Some("File Version 3"));
assert_eq!(parts.next(), Some(".5"));
assert_eq!(parts.next(), Some(".4 (Copy)"));
assert_eq!(parts.next(), Some(".tar"));
assert_eq!(parts.next(), Some(".gz"));
assert_eq!(parts.next(), None);
yescallop commented 2 years ago

FYI, here's my attempt to implement Path::extensions which is akin to suffixes in python. Your comments are very welcome.

let mut exts = Path::new("foo.tar.gz").extensions().unwrap();
assert_eq!(exts.next(), Some(OsStr::new("gz")));
assert_eq!(exts.next(), Some(OsStr::new("tar")));
assert_eq!(exts.next(), None);
assert_eq!(exts.visited(), Some(OsStr::new("tar.gz"));
assert_eq!(exts.remaining(), "foo");

With this implementation, we have .path_prefix() == .extensions().map(|exts| exts.skip_all().remaining()) and .path_suffix() == .extensions().and_then(|exts| exts.skip_all().visited()).

yescallop commented 2 years ago

Correct me if I'm wrong, but I believe that having a method that extracts the part of file name before or after the first dot is not a good idea. By splitting a file name at the first dot, one asserts that they won't have dots in the non-extension part of file name. However, there are actually a bunch of densely-dotted file names in the wild that go against this design, for example:

pyroute2.nslink-0.6.9.tar.gz, api.admin.users.service.spec.ts, Demo.Sales.APISvc.openapi.yaml, sap.ui.webc.common.d.ts, proxy.remote.docker.eximee.conf.json, webpack.dist.components.px.conf.js, org.freebsd.rc.json, and so on.

IMO, none of these file names benefits much from a split at the first dot unless it is guaranteed that we deny dots in the non-extension part of a file name. I'd prefer to go with the iterator solution suggested above and I'm willing to open a PR for this.

EvanCarroll commented 2 years ago

I believe that having a method that extracts the part of file name before or after the first dot is not a good idea. By splitting a file name at the first dot, one asserts that they won't have dots in the non-extension part of file name.

Agreed, that's why I named the original suggestion file_parts and not extension.

We could potentially find better ways to do all of this though. Like simply returning an iterator for file_parts where given foo.bar.tar.gz could be (list monster in graphic below),

yescallop commented 2 years ago

@EvanCarroll That might be a solution, but I think there are some disadvantages to it:

The file_parts method returns a forward iterator over the parts of file name. I think this is against the fact that it is often the case that a program considers the non-extension part of file name opaque to path processing. As long as the extensions match, the program may extract the remaining part of file name and pass it to another function for further parsing, whether there are dots in it or not.

For this reason, a forward iterator seems not very practical. Also, I'm afraid it is not allowed to do a .collect::<&OsStr>() on an iterator that yields OsStr slices, because that requires &OsStr: FromIterator<&OsStr>. Doing a .collect::<OsString>() is possible, but it does unnecessary allocation.

This is why I implemented Path::extensions and two methods on the resulting Extensions struct:

If it is really needed to do what .file_prefix() does right now, we may have:

And if it is sometimes useful to return the extensions with preceding dots, we could have:

mbhall88 commented 2 years ago

Correct me if I'm wrong, but I believe that having a method that extracts the part of file name before or after the first dot is not a good idea. By splitting a file name at the first dot, one asserts that they won't have dots in the non-extension part of file name. However, there are actually a bunch of densely-dotted file names in the wild that go against this design, for example:

pyroute2.nslink-0.6.9.tar.gz, api.admin.users.service.spec.ts, Demo.Sales.APISvc.openapi.yaml, sap.ui.webc.common.d.ts, proxy.remote.docker.eximee.conf.json, webpack.dist.components.px.conf.js, org.freebsd.rc.json, and so on.

IMO, none of these file names benefits much from a split at the first dot unless it is guaranteed that we deny dots in the non-extension part of a file name. I'd prefer to go with the iterator solution suggested above and I'm willing to open a PR for this.

You are right, none of those listed file names benefit from a split at the first dot. But there are many more file names "in the wild" that do. Solution: anyone working with those file names uses some other method to extract the parts they want...
Saying a method is not needed because some corner cases make it useless doesn't negate it being very useful in a whole bunch of other circumstances.

robmv commented 8 months ago

Instead of creating new methods with new semantics that will not treat files with dots on parts of the filename not related to the extension, why not extend the current ones like:

let path = Path::new("foo 1.2.3.tar.gz");
assert_eq!(path.file_stem_multi(2), Some("foo 1.2.3"));
assert_eq!(path.extension_multi(2), Some("tar.gz"));

This will only have a different contract than the current ones, where:

Otherwise, the portion of the file name before the final `.`

is changed to;

Otherwise, the portion of the file name before the nth `.` from the end.

Where file_stem() and extension() are just a simple case of these new methods with the argument 1. Another helper methods like with_extension(..) could be extended to indicate the number of segments of the extension wanted to be replaced.

Note: named with a better suffix than _multi.

mbhall88 commented 8 months ago

@thomcc apologies for pinging you directly but I couldn't find a tag for the library team anywhere. @dtolnay was originally managing this feature but I don't know if he is on the library team anymore? I don't really know how long these features normally take to progress through to stable but this issue has been open for two and a half years which seems like longer than I would expect?
Can you suggest any next steps to get this stabilised?

tgross35 commented 6 months ago

@thomcc apologies for pinging you directly but I couldn't find a tag for the library team anywhere. @dtolnay was originally managing this feature but I don't know if he is on the library team anymore? I don't really know how long these features normally take to progress through to stable but this issue has been open for two and a half years which seems like longer than I would expect? Can you suggest any next steps to get this stabilised?

Opening a stabilization PR is trivial, you just need to change the unstable gate to stable as described here https://rustc-dev-guide.rust-lang.org/stability.html#stabilizing-a-library-feature.

That being said, I have to agree with the concerns that the algorithm here can be a bit bikeshed-y and this could be better served by pattern functions on OsStr, which has been proposed https://github.com/rust-lang/libs-team/issues/311.

Opening a stabilization PR to start libs-api discussion probably isn't a bad idea in any case.

mbhall88 commented 4 weeks ago

@tgross35 according to the stabilization docs you linked the first step is

Ask a @T-libs-api member to start an FCP on the tracking issue and wait for the FCP to complete (with disposition-merge).

Are you a member of said team? If I try and use that identifier in this comment it doesn't come up with anything...

tgross35 commented 4 weeks ago

I am not, but team members can always be found in the relevant files on the teams repo https://github.com/rust-lang/team/blob/master/teams/libs-api.toml. Pinging team members isn't really the best way to get something on the radar though, so opening the stabilization PR first and requesting libs-api review is usually easier for trivial things like this (FCP just happens there).

If you have an idea for an iterator-based API it may be worth proposing that too as an alternative. New API proposals go through the ACP process, which is an issue template at https://github.com/rust-lang/libs-team/issues.

(also feel free to drop in on Zulip if you have any more specific questions)

mbhall88 commented 4 weeks ago

Thanks very much for you help @tgross35. I'll open the PR and we can discuss the FCP there and whether this implementation or an iterator API is preferable.