Open mbhall88 opened 3 years ago
How about a corresponding file_suffix? That completes the set. file_stem/extension and file_prefix/file_suffix.
pub fn file_suffix(&self) -> Option<&OsStr> {
self.file_name().map(split_file_at_dot).and_then(|(_before, after)| Some(after))
}
use std::path::Path;
let path = Path::new("foo.tar.gz");
assert_eq!(path.file_prefix(), Some("foo"));
assert_eq!(path.file_suffix(), Some("tar.gz"));
assert_eq!(path.file_stem(), Some("foo.tar"));
assert_eq!(path.extension(), Some("gz"));
@schwern that is indeed a nice idea. I guess it would be best to get the go ahead from someone in the libs team?
I don't know how these tracking issues work. The PR for this feature was merged a while ago, but I have no idea if I'm supposed to do something with this issue? @dtolnay as you were the reviewer/merger of that PR are you able to answer this?
Any progress on this? It's a very useful API when working with files like tarballs.
@cyqsimon it is currently on nightly I believe. Not sure how long it takes to get onto stable though.
As for @schwern's suggestion, I guess that would have to be raised as a separate PR/issue and approved by someone on the libs team.
How about a corresponding file_suffix? That completes the set. file_stem/extension and file_prefix/file_suffix.
Good point !
One of my concerns is that a file_prefix
or file_suffix
method that simply slices the path wouldn't be very much useful, because it is common that users insert dots in the meaningful part of file names, such as File Version 3.5.4 (Copy).tar.gz
(from Jacob Lifshay on Zulip Chat). The current file_stem
and extension
methods exist mainly because of File Associations on Windows and common practice on Linux, while it is relatively rare that one would need the exact part of file name before or after the first dot.
I'd suggest adding this to the unresolved questions, if possible.
While it might be impossible to have a 100% correct implementation in my opinion it would already be helpful to simply have all rust crates behave in a known manner. E.g., there are now hundreds of rust crates out there dealing with files, e.g., matching filenames, all behaving a little different. It is impossible to know how a rust crate that you're including will behave.
Even having a method that might change over time and for which it is documented how it behaves in certain corner cases would be very, very useful. Compare it to the pathlib
in python
: It is not perfect but it is a beautiful piece of software to use.
while it is relatively rare that one would need the exact part of file name before or after the first dot.
It's not actually relatively rare. We went over this in voice chat. I gave you multiple reasons why someone would want to do this. In practice,
.tar.gz
.spec.ts
.openapi.yaml
.conf.d
(for an example in a directory)It's very popular with javascript (where there is no magic byte and everything is js or ts) .d.ts
, .conf.json
, .conf.js
, .rc.json
, rc.js
, .esm.js
, .cjs.js
. But also with translations .ru.md
etc. Are all common practices that we have to contend with right now many of these in the same project.
The current suggestion to get the functionality is,
.file_name().and_then(OsStr::to_str).map_or(false, |name| name.ends_with(".tar.gz"))
Which I think is a bit convoluted. I am down for doing this other ways, but I don't think something like this should is "rare" as far as path manipulation goes, or that it should require converting to-and-from str
. I think the simple-most question here is
file_suffix
to return the .
that file_prefix
leaves?_ My own preference, I think file = file_prefix + file_suffix
if we go down that route.We could potentially find better ways to do all of this though. Like simply returning an iterator for file_parts
where given foo.bar.tar.gz
could be (list monster in graphic below),
.next().collect::<OsStr>()
, what I think of as extension (.bar.tar.gz
), and I'm patching to be called .file_suffix()
(List Monster tail).last()
, what we currently call .extension()
(List Monster last).first()
what we currently call .file_prefix()
(List Monster head)file_stem
would be all but the last element (not sure if there is a nice way to do that in Rust).. (List Monster init)The bottom line here is that when it comes to the file name, people will logically think of splitting on .
and want names for all possible sets/strings of that list, (from the List Monster in Learn You a Haskell)
Something about that API of returning an Iterator just seems simpler than arguing about whether an "extension" in foo.tar.gz
is .tar.gz
, tar.gz
, .gz
or gz
.
There might've been some misunderstandings here. Let's take this file name File Version 3.5.4 (Copy).tar.gz
(from Jacob Lifshay) as an example.
In this case, .file_prefix()
returns File Version 3
while .file_suffix()
returns .5.4 (Copy).tar.gz
. If you need to determine if this file name ends with .tar.gz
or not, you still need to do stuffs like
.and_then(OsStr::to_str).map_or(false, |name| name.ends_with(".tar.gz"))
IMO this is not optimal. Maybe we could add methods ends_with
and strip_suffix
on OsStr
if this is sufficient for most use cases. But things are getting too complicated in here.
I can see the utility in better processing for File Version 3.5.4 (Copy).tar.gz
. We could define a filename as
filename = ( file_stem, extension? );
file_stem = ( dotfile | non_dots );
dotfile = ( "." + non_dots );
extension = extension_segment+;
extension_segment = ("." , valid_extension );
non_dots = [^.]+?;
valid_extension = [^\s.]+;
I believe the above would make such that
File Version 3.5.4 (Copy).tar.gz
Gets parsed as
file_stem = "File Version 3.5.4 (Copy)"
extension = ".tar.gz"
But it's probably still not sufficient because if the filename was instead File Version 3.5.4.tar.gz
there would literally be nothing you could do if you desired file_name
to be File Version 3.5.4
. And I think that's fine. That's why I advocate returning an iterator for extension
in the above comment, which is what Python does too. That makes processing easier.
IMO this is not optimal. Maybe we could add methods ends_with and strip_suffix on OsStr if this is sufficient for most use cases. But things are getting too complicated in here.
I agree with the OsStr
suggestion. It doesn't make sense anyway not to have strip_suffix
, strip_prefix
, starts_with
and ends_with
on OsStr
. But I still think having good pathing functionality in the library should be a goal, rather than to throw up our hands and say it's "too complex" or to pretend like it's rare to want to know an extension on a file. (Discussion topic on this suggestion opened https://internals.rust-lang.org/t/feature-request-starts-with-and-ends-with-on-osstr/16652)
We could potentially find better ways to do all of this though. Like simply returning an iterator for file_parts where given foo.bar.tar.gz
I think this is probably the "correct" way to approach this. With such an iterator method we can easily implement any/all flavours of splitting up file names.
In borrowing from python's pathlib (which is very ergonomic), it may be as simple as implementing something akin to suffixes
. Because with the method this issue tracks (file_prefix
), and a new method suffixes
, you have all parts.
Using the running example of the corner case file name
use std::path::Path;
let path = Path::new("path/to/File Version 3.5.4 (Copy).tar.gz");
let mut exts = path.suffixes()
assert_eq!(path.file_prefix(), Some("File Version 3"));
assert_eq!(exts.next(), Some(".5"));
assert_eq!(exts.next(), Some(".4 (Copy)"));
assert_eq!(exts.next(), Some(".tar"));
assert_eq!(exts.next(), Some(".gz"));
assert_eq!(exts.next(), None);
Although, I can also understand the argument that the following is more "complete"
use std::path::Path;
let path = Path::new("path/to/File Version 3.5.4 (Copy).tar.gz");
let mut parts = path.file_parts()
assert_eq!(parts.next(), Some("File Version 3"));
assert_eq!(parts.next(), Some(".5"));
assert_eq!(parts.next(), Some(".4 (Copy)"));
assert_eq!(parts.next(), Some(".tar"));
assert_eq!(parts.next(), Some(".gz"));
assert_eq!(parts.next(), None);
FYI, here's my attempt to implement Path::extensions
which is akin to suffixes
in python. Your comments are very welcome.
let mut exts = Path::new("foo.tar.gz").extensions().unwrap();
assert_eq!(exts.next(), Some(OsStr::new("gz")));
assert_eq!(exts.next(), Some(OsStr::new("tar")));
assert_eq!(exts.next(), None);
assert_eq!(exts.visited(), Some(OsStr::new("tar.gz"));
assert_eq!(exts.remaining(), "foo");
With this implementation, we have .path_prefix()
== .extensions().map(|exts| exts.skip_all().remaining())
and .path_suffix()
== .extensions().and_then(|exts| exts.skip_all().visited())
.
Correct me if I'm wrong, but I believe that having a method that extracts the part of file name before or after the first dot is not a good idea. By splitting a file name at the first dot, one asserts that they won't have dots in the non-extension part of file name. However, there are actually a bunch of densely-dotted file names in the wild that go against this design, for example:
pyroute2.nslink-0.6.9.tar.gz
, api.admin.users.service.spec.ts
, Demo.Sales.APISvc.openapi.yaml
, sap.ui.webc.common.d.ts
, proxy.remote.docker.eximee.conf.json
, webpack.dist.components.px.conf.js
, org.freebsd.rc.json
, and so on.
IMO, none of these file names benefits much from a split at the first dot unless it is guaranteed that we deny dots in the non-extension part of a file name. I'd prefer to go with the iterator solution suggested above and I'm willing to open a PR for this.
I believe that having a method that extracts the part of file name before or after the first dot is not a good idea. By splitting a file name at the first dot, one asserts that they won't have dots in the non-extension part of file name.
Agreed, that's why I named the original suggestion file_parts
and not extension
.
We could potentially find better ways to do all of this though. Like simply returning an iterator for
file_parts
where given foo.bar.tar.gz could be (list monster in graphic below),
@EvanCarroll That might be a solution, but I think there are some disadvantages to it:
The file_parts
method returns a forward iterator over the parts of file name. I think this is against the fact that it is often the case that a program considers the non-extension part of file name opaque to path processing. As long as the extensions match, the program may extract the remaining part of file name and pass it to another function for further parsing, whether there are dots in it or not.
For this reason, a forward iterator seems not very practical. Also, I'm afraid it is not allowed to do a .collect::<&OsStr>()
on an iterator that yields OsStr
slices, because that requires &OsStr: FromIterator<&OsStr>
. Doing a .collect::<OsString>()
is possible, but it does unnecessary allocation.
This is why I implemented Path::extensions
and two methods on the resulting Extensions
struct:
.visited()
to return the visited extensions as a consecutive OsStr
..remaining()
to return the part of file name remaining for iteration.If it is really needed to do what .file_prefix()
does right now, we may have:
.skip_all()
to skip all the remaining extensions. Then you call .remaining()
to get file_prefix
or .visited()
to get file_suffix
.And if it is sometimes useful to return the extensions with preceding dots, we could have:
.with_preceding_dots()
to make the iterator do that.Correct me if I'm wrong, but I believe that having a method that extracts the part of file name before or after the first dot is not a good idea. By splitting a file name at the first dot, one asserts that they won't have dots in the non-extension part of file name. However, there are actually a bunch of densely-dotted file names in the wild that go against this design, for example:
pyroute2.nslink-0.6.9.tar.gz
,api.admin.users.service.spec.ts
,Demo.Sales.APISvc.openapi.yaml
,sap.ui.webc.common.d.ts
,proxy.remote.docker.eximee.conf.json
,webpack.dist.components.px.conf.js
,org.freebsd.rc.json
, and so on.IMO, none of these file names benefits much from a split at the first dot unless it is guaranteed that we deny dots in the non-extension part of a file name. I'd prefer to go with the iterator solution suggested above and I'm willing to open a PR for this.
You are right, none of those listed file names benefit from a split at the first dot. But there are many more file names "in the wild" that do. Solution: anyone working with those file names uses some other method to extract the parts they want...
Saying a method is not needed because some corner cases make it useless doesn't negate it being very useful in a whole bunch of other circumstances.
Instead of creating new methods with new semantics that will not treat files with dots on parts of the filename not related to the extension, why not extend the current ones like:
let path = Path::new("foo 1.2.3.tar.gz");
assert_eq!(path.file_stem_multi(2), Some("foo 1.2.3"));
assert_eq!(path.extension_multi(2), Some("tar.gz"));
This will only have a different contract than the current ones, where:
Otherwise, the portion of the file name before the final `.`
is changed to;
Otherwise, the portion of the file name before the nth `.` from the end.
Where file_stem()
and extension()
are just a simple case of these new methods with the argument 1
. Another helper methods like with_extension(..)
could be extended to indicate the number of segments of the extension wanted to be replaced.
Note: named with a better suffix than _multi
.
@thomcc apologies for pinging you directly but I couldn't find a tag for the library team anywhere. @dtolnay
was originally managing this feature but I don't know if he is on the library team anymore? I don't really know how long these features normally take to progress through to stable but this issue has been open for two and a half years which seems like longer than I would expect?
Can you suggest any next steps to get this stabilised?
@thomcc apologies for pinging you directly but I couldn't find a tag for the library team anywhere.
@dtolnay
was originally managing this feature but I don't know if he is on the library team anymore? I don't really know how long these features normally take to progress through to stable but this issue has been open for two and a half years which seems like longer than I would expect? Can you suggest any next steps to get this stabilised?
Opening a stabilization PR is trivial, you just need to change the unstable
gate to stable
as described here https://rustc-dev-guide.rust-lang.org/stability.html#stabilizing-a-library-feature.
That being said, I have to agree with the concerns that the algorithm here can be a bit bikeshed-y and this could be better served by pattern functions on OsStr
, which has been proposed https://github.com/rust-lang/libs-team/issues/311.
Opening a stabilization PR to start libs-api discussion probably isn't a bad idea in any case.
@tgross35 according to the stabilization docs you linked the first step is
Ask a @T-libs-api member to start an FCP on the tracking issue and wait for the FCP to complete (with
disposition-merge
).
Are you a member of said team? If I try and use that identifier in this comment it doesn't come up with anything...
I am not, but team members can always be found in the relevant files on the teams repo https://github.com/rust-lang/team/blob/master/teams/libs-api.toml. Pinging team members isn't really the best way to get something on the radar though, so opening the stabilization PR first and requesting libs-api review is usually easier for trivial things like this (FCP just happens there).
If you have an idea for an iterator-based API it may be worth proposing that too as an alternative. New API proposals go through the ACP process, which is an issue template at https://github.com/rust-lang/libs-team/issues.
(also feel free to drop in on Zulip if you have any more specific questions)
Thanks very much for you help @tgross35. I'll open the PR and we can discuss the FCP there and whether this implementation or an iterator API is preferable.
Feature gate:
#![feature(path_file_prefix)]
This is a tracking issue for an implementation of the method
std::path::Path::file_prefix
. It is effectively a "left" variant of the existingfile_stem
method.Public API
Steps / History
Unresolved Questions