rust-lang / rfcs

RFCs for changes to Rust
https://rust-lang.github.io/rfcs/
Apache License 2.0
5.93k stars 1.57k forks source link

Equivalent to Python's os.path.normpath #2208

Open linclelinkpart5 opened 6 years ago

linclelinkpart5 commented 6 years ago

In Python, os.path.normpath performs a simple lexical normalization of a path, removing redundant dir separators and cur-dir ('.') parts, and resolving up-dir references ('..'). In addition, on Windows, any forward slashes are converted to backslashes.

It seems that Rust's std::fs::canonicalize is functionally similar to Python's os.path.realpath (with the addition of a check to see if the target exists), and is a superset of normpath. However, I have a case where I only want to normalize the path, and not resolve symlinks.

I'm quite new to Rust, but I'd be happy to help contribute a new method to do this to the library if I can.

ExpHP commented 6 years ago

Heavens, yes! I can't believe rust doesn't have something like this!

I think it ought to be a method on Path (since this is path functionality and not filesystem functionality), and should return just a plain old PathBuf--no io::Result--not just because it isn't needed, but also to better communicate its nature as a pure function.

eddyb commented 6 years ago

cc @rust-lang/libs

linclelinkpart5 commented 6 years ago

I think I've successfully written a good first draft of this method. My aim was to replicate the version in Python (linked above) and Go. The documentation for the Go version in turn links to the Plan 9 document detailing the high-level algorithm of this sort of normalization, handy!

Right now, it's a free function, not in the std::path crate yet. As such, the method takes in an explicit &Path as opposed to a &self, but that can easily be fixed.

use std::path::Path;
use std::path::PathBuf;
use std::path::Component;

fn normalize(p: &Path) -> PathBuf {
    let mut stack: Vec<Component> = vec![];

    // We assume .components() removes redundant consecutive path separators.
    // Note that .components() also does some normalization of '.' on its own anyways.
    // This '.' normalization happens to be compatible with the approach below.
    for component in p.components() {
        match component {
            // Drop CurDir components, do not even push onto the stack.
            Component::CurDir => {},

            // For ParentDir components, we need to use the contents of the stack.
            Component::ParentDir => {
                // Look at the top element of stack, if any.
                let top = stack.last().cloned();

                match top {
                    // A component is on the stack, need more pattern matching.
                    Some(c) => {
                        match c {
                            // Push the ParentDir on the stack.
                            Component::Prefix(_) => { stack.push(component); },

                            // The parent of a RootDir is itself, so drop the ParentDir (no-op).
                            Component::RootDir => {},

                            // A CurDir should never be found on the stack, since they are dropped when seen.
                            Component::CurDir => { unreachable!(); },

                            // If a ParentDir is found, it must be due to it piling up at the start of a path.
                            // Push the new ParentDir onto the stack.
                            Component::ParentDir => { stack.push(component); },

                            // If a Normal is found, pop it off.
                            Component::Normal(_) => { let _ = stack.pop(); }
                        }
                    },

                    // Stack is empty, so path is empty, just push.
                    None => { stack.push(component); }
                }
            },

            // All others, simply push onto the stack.
            _ => { stack.push(component); },
        }
    }

    // If an empty PathBuf would be return, instead return CurDir ('.').
    if stack.is_empty() {
        return PathBuf::from(Component::CurDir.as_ref());
    }

    let mut norm_path = PathBuf::new();

    for item in &stack {
        norm_path.push(item.as_ref());
    }

    norm_path
}

Some sample (Unix) file path tests:

fn main() {
    let mut paths = vec![];

    paths.push(Path::new("../../home/thatsgobbles/././music/../code/.."));
    paths.push(Path::new("/home//thatsgobbles/music/"));
    paths.push(Path::new("/../../home/thatsgobbles/././code/../music/.."));
    paths.push(Path::new(".."));
    paths.push(Path::new("/.."));
    paths.push(Path::new("../"));
    paths.push(Path::new("/"));
    paths.push(Path::new(""));
    // More tests for Windows (especially with drive letters and UNC paths) needed.

    for p in &paths {
        let np = normalize(&p);
        println!("{:?} ==> {:?}", &p, &np);
    }
}
ExpHP commented 6 years ago

Brute force comparison:

https://gist.github.com/ExpHP/3f7d8c03be1a45ebe5abd3ad5a517d73

fn gen_strings(alphabet: &[&str], s: String, depth: u8, emit: fn(&str)) {
    emit(&s);
    if let Some(deeper) = depth.checked_sub(1) {
        for &ch in alphabet {
            gen_strings(alphabet, s.clone() + ch, deeper, emit);
        }
    }
}

fn main() {
    gen_strings(&["a", ".", "/"], Default::default(), 12, |s| {
        let p = Path::new(&s);
        println!("{:>15} -> {:>15}", p.display(), normalize(&p).display());
    });
}
import os

def gen_strings(alphabet, s, depth):
    yield s
    if depth > 0:
        for ch in alphabet:
            yield from gen_strings(alphabet, s + ch, depth - 1)

for s in gen_strings("a./", '', 12):
    print("{:>15} -> {:>15}".format(s, os.path.normpath(s)))

Generating all strings of "a", ".", and "/" up to length 12 on Arch, there is one behavioral difference between your implementation and normpath, which is that, if a path begins with // but does not begin with ///, then normpath keeps the leading //. Interestingly, it appears that this is a thing in POSIX. (Makes me wonder if Python still does that on windows?)

It'd be nice to survey more languages and know what decisions and tradeoffs are on the table.

linclelinkpart5 commented 6 years ago

@ExpHP Thank you for the extended tests! I don't think I would have found that corner case without your help.

The double forward slash case is quite strange, I admit. I'm surprised that it ends up resolving to just /. I'm even newer at Go than Rust, but I can see if I can whip up a comparable test using the previously linked .Clean() method. I believe there's also such a routine in Node.js.

I'm going to look into installing Rust on my Windows machine tonight, and trying out some tests on there.

Centril commented 6 years ago

Closing in favor of https://github.com/rust-lang/rust/issues/47402.

akavel commented 5 years ago

@Centril is this correct to close this issue, given that the https://github.com/rust-lang/rust/issues/47402 is (just?) a "tracking issue", which actually points back to this very issue? I'm really sorry if this is something normal, but as a Rust newbie, I really don't understand what's the status of this in such a situation :( Is it open? closed? worked on? not worked on? what could I do if I wanted to, say, submit a solution proposal?

ExpHP commented 5 years ago

I recall that this was associated with a PR that was never finished after a long period of back-and-forth. Looking at the linkbacks, this is it:

https://github.com/rust-lang/rust/pull/47363

Since we haven't heard from the author for a long time, I'd reckon it's fair game for anybody who wants to rebase that PR and address the outstanding feedback.


Issues on the RFC repository generally aren't regarded as actionable items (people for some reason have been using them as feature requests or like a discussion forum), so their open/close state is largely immaterial beyond the potential to confuse people. I'd always look at the linkbacks.

linclelinkpart5 commented 5 years ago

Original author here, apologies for the radio silence. It's been a long time since I've had time to work on this, not to mention my inexperience with contributing to open source in general.

Aside from that, the one issue I encountered with this was that Windows paths weren't all getting normalized as expected, as per a recommended Windows path test suite (seen in the discussion on the PR rust-lang/rust#47363).

My hope was that using Path.components() would be sufficient to do the normalization, but it seems that .components() isn't working as expected for my case.

Mark-Simulacrum commented 5 years ago

I'm going to reopen this instead of https://github.com/rust-lang/rust/issues/47402 as the associated PR never landed so unfortunately we don't actually have anything to track at this point (beyond the desire for the feature itself, which is betters suited to the rfcs repo).

nic-hartley commented 5 years ago

Is this still looking for someone to implement it? I'm happy to give it a go, as long as someone familiar with Windows paths can give me some edge cases to test against.

Also, given that this won't access the filesystem (that's canonicalize's job), how do we want to handle paths like foo/../../bar which go "out of bounds"? Error? Ignore extra .. (as happens when you reach / in *nix)? Something else?

retep998 commented 5 years ago

On Windows foo/../../bar is equivalent to ..\bar so that is how I would handle it there.

gdzx commented 4 years ago

I wrote a port of some functions I needed from the Go's path/filepath module and some discussions even lead to a pre-RFC:

ssokolow commented 3 years ago

Has anyone poked the author of the path-clean crate to see if he'd be interested in getting it into std?

It looks to be an MIT/Apache-2.0 pure-Rust implementation of the same Plan 9 algorithm Go implemented for doing this.

(It does need a little work but it'd need that either way so, if he's willing to work on it...)

gdzx commented 3 years ago

@ssokolow Can you take a look at npath? It is tested on Unix and Windows, more rusty than the Go implementation (I also started from it), and with more methods around normalization.

ssokolow commented 3 years ago

It uses unsafe. Use of unsafe in a place where I don't consider it necessary is an instant disqualification.

gdzx commented 3 years ago

Then I can point you to Path::new, which uses unsafe in the same way each time you call it, and surely is an instant disqualification of the whole Rust standard library.

ssokolow commented 3 years ago

@gdzx I said "where I don't consider it necessary". As far as I'm concerned, this algorithm can be implemented perfectly well, relying only on the unsafe that already exists in the much more thoroughly audited standard library functions, and I strongly believe that unsafe should only be used when it's either necessary for FFI (not for the kind of normalization I want) or empirically proven that the performance benefits are significant enough to justify it. (And I still want a good test suite, being run on CI under Miri.)

If I run out of projects that don't need this functionality before someone contributes code that meets my standards, I'm perfectly willing to implement it myself, with #![forbid(unsafe_code)] on top for good measure.

BurntSushi commented 3 years ago

Whether unsafe is necessary or not in this case is a red herring. Your use of unsafe relies on implementation details and is not a stable part of Path's API: https://doc.rust-lang.org/src/std/path.rs.html#1679

So I personally would consider it to be incorrect use of unsafe.

ssokolow commented 3 years ago

That too. I'll freely admit that, as soon as I saw unsafe in something, my "unsafe? That doesn't need to be here!" reaction stopped me from looking into what it was being used for.

From my perspective, it didn't matter what it was being used for if the algorithm "wasn't supposed to be" using it in the first place.

BurntSushi commented 3 years ago

FWIW, it can be a pretty severe pain point to deal with paths in some cases because its internal wtf-8 representation isn't exposed. Without exposing it, there will always be a strong temptation to do exactly this sort of thing to get access to it.

What I've done in the past to simplify is to use the Unix specific APIs to get the underlying &[u8] directly and safely. Then on Windows, give up 100% correctness and lossily convert the path to UTF-8. But that trade off may not always be appropriate. The alternatives are either a lot more complex or a lot more costly. (But also, exposing the internal wtf-8 comes with its own costs.)

gdzx commented 3 years ago

@ssokolow There is a beautiful thing on GitHub, it's called "Create an issue". You can call it "Remove uses of unsafe when unnecessary". That would be a great start instead of talking nonsense about "disqualification", on a project that isn't even released yet. Also, it is quite an inconsequential issue compared to actually implementing a lexical path processing algorithm, checking its correctness, and its properties.

BurntSushi commented 3 years ago

@ssokolow @gdzx All right, let's dial the temperature down a bit please. Thanks.

gdzx commented 3 years ago

Then let me reestablish the truth: my implementation of normalized does not use unsafe. And the only uses of unsafe is to get the inner Vec<u8> when updating a PathBuf in place, and getting the path separator as a &str. That shouldn't be a deal breaker...

BurntSushi commented 3 years ago

That shouldn't be a deal breaker...

Maybe @ssokolow was a little harsh, but yes, it would also be a deal breaker for me too. It's not just unnecessary. It's incorrect. See my previous comment.

ssokolow commented 3 years ago

@gdzx I was about to apologize for being snippy when @BurntSushi beat me to acknowledging it (I tend to not notice when I'm emotionally compromised until someone points it out). That said, I am still curious why you would think I would spontaneously choose to reply in a communication channel different than the one that we began the interaction in without being explicitly prompted to.

As for unsafe, unnecessary, incorrect, or not, it's a matter of the auditing burden a dependency imposes on me. The crate uses unsafe. I don't want to have to re-audit that every time I bump the version. (I keep a close eye on my cargo-geiger output.)

gdzx commented 3 years ago

@BurntSushi I'll take "a little harsh" as an apology :D.

@ssokolow I don't understand why you would want to "reply in a communication channel different than the one that we began the interaction in".

Also I will repeat that you can just take normalized out my crate and enjoy no unsafe code at all. As for the use of unsafe, I implemented lexical_push just like I would do if I was writing code in std::path::PathBuf, using the equivalent of PathBuf::as_mut_vec which is private.

Anyway, if you have any idea for improvement, feel free to open an issue (just like you did on the other crate you linked to which also uses unsafe).

ssokolow commented 3 years ago

I don't understand why you would want to "reply in a communication channel different than the one that we began the interaction in".

You said...

There is a beautiful thing on GitHub, it's called "Create an issue". You can call it "Remove uses of unsafe when unnecessary". That would be a great start instead of talking nonsense about "disqualification"

...which, to me, felt like "You should have replied in the bug tracker instead of here. Because you didn't read my mind and understand that, you have done wrong."

Also I will repeat that you can just take normalized out my crate and enjoy no unsafe code at all.

If I'd have to copy your code into my project and take responsibility for maintaining a copy of it to avoid having to re-audit the crate every so often, I think I'd prefer to just rewrite the algorithm in a style more intuitive to me: my own.

Honestly, thanks to what the .components() iterator already does, it doesn't seem like it'd be that difficult. Annoying to support both POSIX and Windows to my standards, given that the WTF-8 representation isn't part of the public API, but not difficult.

As for the use of unsafe, I implemented lexical_push just like I would do if I was writing code in std::path::PathBuf, using the equivalent of PathBuf::as_mut_vec which is private.

That's the problem. You wrote it as if it were part of std, when it's not. That disregard for the concepts of encapsulation and public vs. private APIs is why I wouldn't want to rely on your code. I place a very high value on API stability and I wouldn't want to discover, 10 years down the road, that one of my old projects can't be built without extra work because a dependency unnecesarily tied itself to internal implementation details of an old std version that are exempt from the Rust API stability promise.

If your code were part of std then there'd be a whole testing infrastructure to catch breakages and it'd be the responsibility of whoever else was working on std to update it to fix what their patch broke.

BurntSushi commented 3 years ago

As for the use of unsafe, I implemented lexical_push just like I would do if I was writing code in std::path::PathBuf, using the equivalent of PathBuf::as_mut_vec which is private.

Right, you can't do that. That's what is incorrect because it relies on an implementation detail that is not part of the stable API. The next release of Rust could silently break your code. And since you're using unsafe, that breakage could result in memory unsafety.

gdzx commented 3 years ago

@ssokolow

...which, to me, felt like "You should have replied in the bug tracker instead of here. Because you didn't read my mind and understand that, you have done wrong."

You really have a disrespectful tone, I don't understand why. I didn't ask you to read my mind, I asked you to do the same as what you did on the other crate to remove the use of unsafe, which is in my crate in a method that is not in any way related to the very title of this issue: "Equivalent to Python's os.path.normpath".

If I'd have to copy your code into my project and take responsibility for maintaining a copy of it to avoid having to re-audit the crate every so often, I think I'd prefer to just rewrite the algorithm in a style more intuitive to me: my own.

Of course, that's why it is nice to contribute to existing project that already support the use case, have CI, and work on Unix and Windows, so they can improve and eventually release.

@BurntSushi

Right, you can't do that. That's what is incorrect because it relies on an implementation detail that is not part of the stable API. The next release of Rust could silently break your code. And since you're using unsafe, that breakage could result in memory unsafety.

Well, I can do it, but not without the limitations you outlined. However, I have the intention to make it part of the standard library, and it will serve as a good-enough PoC (which it still is now, since it doesn't have any release yet). But if you want to propose an improvement to make this crate suit your use-case, remove all use of unsafe, and work toward a release, then please make a pull request or open an issue!

ssokolow commented 3 years ago

You really have a disrespectful tone,

How is it disrespectful to be honest about how I perceived what you'd said? You said something here, I tactlessly replied here, and you responded with what I, rightly or wrongly, perceived as an attack for having responded here rather than in the bug tracker.

Well, I can do it, but not without the limitations you outlined. However, I have the intention to make it part of the standard library, and it will serve as a good-enough PoC (which it still is now, since it doesn't have any release yet). But if you want to propose an improvement to make this crate suit your use-case, remove all use of unsafe, and work toward a release, then please make a pull request!

I think that might be the point of miscommunication. It wasn't made clear enough that you're intentionally writing it in a form that is not recommended for use outside of std, as a proof of concept for how you feel it would best be implemented as an internal component.

gdzx commented 3 years ago

@ssokolow I can agree on your last point. I will update the README to make that clear, but it's not on crates.io and the v0.1.0 is not released, that's not exactly production-ready. And that doesn't mean it is unsalvageable in the meantime to make it more appropriate for your use-case. Unless of course you want to improve on the Go port, but that is exactly what I did to make this crate.

ssokolow commented 3 years ago

@gdzx I have to go shower and get to sleep, but I'll see if I can fit in a look at it tomorrow.

gdzx commented 3 years ago

@ssokolow Thank you ;).

Ekleog commented 3 years ago

TL;DR for people like me who came back with 20 new notifications of diverse tones: here is a draft of what an implementation of this lifted to libstd might look like, but is currently not usable as a crate due to relying on PathBuf implementation details and thus being unstable. It would have to be adjusted at least a bit to be usable as a crate on regular Rust.

I don't have any kind of authority here, but maybe let's move the discussion about the pros and cons of this draft implementation to the issues of the linked repo, and try not to drown this issue in messages unrelated to the API design and desirability of this idea? :)

gdzx commented 3 years ago

In that case, I just pushed a very early RFC draft for this API. There are some links to implementations in other languages and some explanations. But as I said, it's a first draft.

ssokolow commented 3 years ago

I don't have time to look at the code yet, but I've left some inline comments on the commit for the RFC draft.

Aside from that, I like what is proposed. It has a bunch of functions, like is_inside where I found myself thinking "Yeah. Why did Python's standard library force me to keep reinventing that in my programs if I need it so often?"

Heck, I have some ugly hackery for want of rooted_join in one of my recent hobby projects. (To work around the fact that the Url crate only exposes portable conversion from OS paths to URL path components via from_file_path and from_directory_path as far as I can tell, but "This returns Err if the given path is not absolute".)