python / cpython

The Python programming language
https://www.python.org/
Other
60.24k stars 29.15k forks source link

Support moving across filesystems in pathlib.Path, as shutil.move() does #73991

Open be1526b9-3573-4f3d-abe3-d392fe59c2a9 opened 7 years ago

be1526b9-3573-4f3d-abe3-d392fe59c2a9 commented 7 years ago
BPO 29805
Nosy @brettcannon, @pfmoore, @ericvsmith, @tjguk, @zware, @eryksun, @zooba

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['3.8', 'type-feature', 'library', '3.9', '3.10'] title = 'Support moving across filesystems in pathlib.Path, as shutil.move() does' updated_at = user = 'https://bugs.python.org/LaurentMazuel' ``` bugs.python.org fields: ```python activity = actor = 'eryksun' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'Laurent.Mazuel' dependencies = [] files = [] hgrepos = [] issue_num = 29805 keywords = [] message_count = 6.0 messages = ['289549', '289552', '289559', '289630', '289687', '289688'] nosy_count = 8.0 nosy_names = ['brett.cannon', 'paul.moore', 'eric.smith', 'tim.golden', 'Laurent.Mazuel', 'zach.ware', 'eryksun', 'steve.dower'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue29805' versions = ['Python 3.8', 'Python 3.9', 'Python 3.10'] ```

Linked PRs

be1526b9-3573-4f3d-abe3-d392fe59c2a9 commented 7 years ago

Trying to use Pathlib and Path.replace on Windows if drive are different leads to an issue:

  File "D:\\myscript.py", line 184, in update
    client_generated_path.replace(destination_folder)
  File "c:\\program files (x86)\\python35-32\\Lib\\pathlib.py", line 1273, in replace
    self.\_accessor.replace(self, target)
  File "c:\\program files (x86)\\python35-32\\Lib\\pathlib.py", line 377, in wrapped
    return strfunc(str(pathobjA), str(pathobjB), \*args)
OSError: [WinError 17] The system cannot move the file to a different disk drive: 'C:\\\\MyFolder' -\> 'D:\\\\MyFolderNewName'

This is a known situation of os.rename, and workaround I found is to use shutil or to copy/delete manually in two steps (e.g. http://stackoverflow.com/questions/21116510/python-oserror-winerror-17-the-system-cannot-move-the-file-to-a-different-d)

When using Pathlib, it's not that easy to workaround using shutil (even if thanks to Brett Cannon now shutil accepts Path in Py3.6, not everybody has Py3.6). At least this should be documented with a recommendation for that situation. I love Pathlib and it's too bad my code becomes complicated when it was so simple :(

be1526b9-3573-4f3d-abe3-d392fe59c2a9 commented 7 years ago

Just to confirm, I was able to workaround it with Py3.6:

    # client_generated_path.replace(destination_folder)
    shutil.move(client_generated_path, destination_folder)

It is reasonable to think about adding a similar feature to pathlib if it doesn't support it and just special-case the drive-to-drive scenario for Windows

eryksun commented 7 years ago

Moving a file across volumes isn't atomic. Getting an exception in this case can be useful. There could be a "strict" keyword-only parameter that defaults to False. If it's true, then replace() won't try to move the file.

ericvsmith commented 7 years ago

I agree this needs to be different from replace(), due to not being atomic. That makes it an enhancement, so I'm removing 3.5.

I'm +1 on Path supporting something like shutil.move().

brettcannon commented 7 years ago

I also support the idea of getting something like shutil.move() into pathlib.

brettcannon commented 7 years ago

I should also mention that rename() (https://docs.python.org/3/library/pathlib.html#pathlib.Path.rename) and replace() (https://docs.python.org/3/library/pathlib.html#pathlib.Path.replace) already do exist, so it might be best to add a keyword-only flag to one of those for this use-case.

barneygale commented 1 year ago

I'm also +1 on adding pathlib.Path.move(). However I'd prefer we move the shutil implementation into pathlib, as proposed here. The implementation in shutil would become a stub that calls into pathlib, roughly:

def move(src, dst, copy_function=copy2):
    return pathlib.Path(src).move(dst, copy_function=copy2)

I'm interested in this approach because I'm hoping to add a pathlib.AbstractPath class in future. Users would be able to subclass AbstractPath, and by implementing a small handful of low-level abstract methods like stat(), open() and iterdir(), benefit from high-level methods like move() in their objects. This would work even if the source and destination are different kinds of path (e.g. a local filesystem path and a path in a tar archive).

To achieve this we'd need to make pathlib accept bytes paths, as shutil does. Personally I'm fine with this - it's not much work and perfectly reasonable.

We'd also need to move other shutil functions that move() relies on to pathlib, notably the copy*() functions, excluding copyfileobj() and friends.

On rename() / replace(): I'd like these to call through to move(atomic=True). IIRC these methods are identical on POSIX and only differ slightly on Windows. That's a wrinkle we should smooth out in pathlib IMO.

zooba commented 1 year ago

I'd prefer we move the shutil implementation into pathlib

Sounds reasonable to me, though would we want to include copy_function in the API? It seems like an opportunity to say that Path is opinionated about the "most move-like way to copy", and if you need to override that then handle it yourself or stick with shutil.

ericvsmith commented 1 year ago

I think @barneygale 's plan is sound.

pfmoore commented 1 year ago

I agree with @zooba, exposing the copy_function in the pathlib API seems incompatible with the general style of pathlib. Otherwise the plan looks good.

barneygale commented 1 year ago

I agree about copy_function. Perhaps we could do something like this?

    def copy(self, dst, *, follow_symlinks=True, copy_mode=True, copy_stat=False):
        ...

    def move(self, dst, *, atomic=False, **kwargs):
        if atomic:
            return os.rename(self, dst)
        self.copy(dst, **kwargs)
        self.unlink()
zooba commented 1 year ago

I don't like the atomic argument, and especially its default. Implementations should get to implement it as efficiently as possible without needing to respect that kind of argument, and callers shouldn't expect that level of control over semantics.

For example, on Windows we explicitly disallow os.rename from moving files to other drives, because it wouldn't match our documented behaviour. But a new WindowsPath.move function should be able to allow that, and take advantage of the OS optimisations (yes, they exist!) for moving a file rather than copying it across drives.

If we were to define semantics for atomic, the only way to allow this would be "atomic=True will try to move, and atomic=False won't bother trying to move", at which point why would anybody ever pass False? Especially since there's no back-compat here, and so code on older versions will have to stick with try: ...rename ... except: ... copy.

We should make these functions an improvement in semantics, not just discoverability, and ideally more appropriate for whatever platform they're being used on than the POSIX semantics inherent to posixmodule.

So I think the only option we want on move is follow_symlinks=True,[^1] and the method "does whatever is necessary to move the file or directory as efficiently as possible into the new location while matching the semantics of an in-place rename" (roughly, copy_mode=True, copy_stat=True)

Similarly, I think we should simplify copy down to only have follow_symlinks=True,[^2] and the method "does whatever is necessary to copy the file or directory as efficiently as possible into the new location while matching the semantics of a newly created file" (roughly, copy_mode=some bits, copy_stat=some bits - and this one is why the vague definition is important 😉 )

[^1]: Where follow_symlinks=False is carefully defined to mean "if self is a link, it moves the link unless that would cause the new link to fail to refer to the same target as the original link, in which case the target is copied into the new location and the original link is removed".

[^2]: I think the same definition as for move works here, but I haven't thought through this one as thoroughly.

barneygale commented 1 year ago

My example was a bit rubbish - here's a corrected version:

    def copy(self, dst, *, follow_symlinks=True, copy_mode=True, copy_stat=True):
        ...

    def move(self, dst, *, atomic=False, **kwargs):
        try:
            return os.rename(self, dst)
        except OSError:
            if atomic:
                raise  # Move can't be achieved atomically
        self.copy(dst, **kwargs)
        self.unlink()

So move(atomic=True) would call os.rename() only and allow exceptions to propagate, whereas atomic=False will try a rename, and failing that use a copy/delete. I think there's common use cases for both.

I think I agree with you on removing copy_mode and copy_stat.

barneygale commented 1 year ago

I suppose one could argue that folks should just use os.rename() if they want an atomic rename! :)

zooba commented 1 year ago

I suppose one could argue that folks should just use os.rename() if they want an atomic rename! :)

Yeah, I'll argue that one ;)

We'll have to explain that move may result in both old and new names existing simultaneously while the move takes place, depending on the operating system and file systems in use. That's a perfectly fine place to say that if you'd rather that not happen, and are willing to accept exceptions if it's unavoidable, then you should use rename.

barneygale commented 1 year ago

Should we deprecate pathlib.Path.rename() and replace() when we add move()?

zooba commented 1 year ago

Why? They'll just be aliases (with a default arg) to move(), won't they? It's easier to just leave them there.

barneygale commented 1 year ago

That would be a change of behaviour though, right? Path.rename() currently raises an exception if you attempt to move across filesystems, whereas if we changed it to call shutil.move() it would not.

zooba commented 1 year ago

I assumed if you were going to deprecate it, then you'd be adding the atomic option to be able to get the same behaviour back. If that option isn't there, definitely don't deprecate the old functions.

barneygale commented 1 year ago

Oh right. I thought we agreed that users could just call os.rename() or os.replace() if they specifically need that behaviour, and that we wouldn't add an atomic argument to Path.move(). I think I still don't understand what you're proposing! Sorry :/

zooba commented 1 year ago

I think just add the new method, and don't deprecate any of the old ones. Moving people from Path.rename() to os.rename() feels like unnecessary churn to inflict.

We probably want to improve the docs of those to point people towards move() though, as we believe (don't we?) that it'll work more reliably in more circumstances. But there's no reason to force them.

barneygale commented 1 year ago

Thanks for the explanation!

Would you also rule out deprecating Path.rename() and Path.replace() in future? If so I'm -1 on adding move(). It's regrettable that we have two methods for moving files, and I wouldn't want to permanently add a third.

zooba commented 1 year ago

Technically, we have two functions for renaming files, and none for moving ;-)

I'd want to see it shown that they're a maintenance burden or an attractive nuisance[^1] before deprecating. Right now, I'm not convinced they are either (a bit of an attractive nuisance, but their long history outweighs that).

[^1]: Something that users assume will solve their issue, but it actually makes things worse for them.

barneygale commented 4 months ago

Re-visiting this issue!

Earlier I suggested that we should be able to move the shutil.move() implementation into pathlib. I now believe that won't work, because:

So I think that means we need to call shutil.move() from pathlib.Path.move().

We now have a pathlib._abc.PathBase base class which ought to have a move() method too. That method can be implemented atop the existing self.walk(), self.open(), etc methods, though it means also implementing versions of copy(), copytree() and rmtree(). These could be private to begin with, or we could add all four public methods at once.

We could make pathlib.Path.move() accept a pathlib._abc.PathBase | os.PathLike destination. If the destination is os.PathLike we call shutil.move(), otherwise we call super().move(). If we made copy() and copytree() public too, users would have a pretty versatile set of tools to copy/move files not just across local filesystems, but between entirely different types of filesystem. So:

source_path = pathlib.Path('images')
target_path = s3.S3Path('web/images', bucket=..., client=...)
source_path.copytree(target_path)

There's already quite a lot of pathlib changes in 3.13, so I won't look at landing any of this until main becomes 3.14.

barneygale commented 4 months ago

In shutil.move() and the mv shell command, if the destination is an existing directory, the source is copied into that directory.

Is this something we want to replicate in pathlib? It can be a source of bugs, e.g. if you shutil.move('a', 'b') and assume the new file is at 'b'.

Alternatively we could split it into move() and move_into() perhaps?

zooba commented 4 months ago

We shouldn't replicate that behaviour, but if the destination ends with a slash then I think we're okay to keep the original name and make that one work (pathlib paths won't ever end with a slash, I believe).

barneygale commented 2 weeks ago

PRs available:

Once those land I can add copytree() and finally move()

serhiy-storchaka commented 2 weeks ago

Is this a firm established? Because to me, adding the parts of shutil to pathlib looks very wrong. Why you cannot just use shutil directly? All this hassle with the path-like protocol was for making pathlib more usable with os, shutil and other modules without increasing coupling between modules. Adding the shutil dependency to pathlib will provide another argument to avoid using pathlib.

zooba commented 2 weeks ago

Without having looked at the details, I agree with Serhiy. We should avoid import shutil in modules that are "lower level" than it, which includes pathlib.

Passing a Path to shutil functions and having them work correctly (by calling back into pathlib, rather than assuming anything else about the underlying file system) is the important part. Maybe that already works fine? I haven't tried it with e.g. an S3 or Blob Storage implementation of Path.

But perhaps we should be adding copytree/rmtree methods to pathlib anyway that are self-contained? So the base implementation will work for anything that implements its copy, glob and unlink methods (I agree with copy, move and rename being added).

barneygale commented 2 weeks ago

Is this a firm established? Because to me, adding the parts of shutil to pathlib looks very wrong. Why you cannot just use shutil directly?

This is an argument against pathlib in general, and you're more than a decade too late unfortunately.

All this hassle with the path-like protocol was for making pathlib more usable with os, shutil and other modules without increasing coupling between modules.

The os.PathLike stuff meant that os didn't need to import pathlib, but you seem to be suggesting the opposite? This patch doesn't make shutil import pathlib.

Adding the shutil dependency to pathlib will provide another argument to avoid using pathlib.

Outline that argument, please? Increased import time or something?

barneygale commented 2 weeks ago

Passing a Path to shutil functions and having them work correctly (by calling back into pathlib, rather than assuming anything else about the underlying file system) is the important part. Maybe that already works fine? I haven't tried it with e.g. an S3 or Blob Storage implementation of Path.

Pathlib strips important information (e.g. leading ./, trailing /) so we can't implement shutil.rmtree() using Path.rmtree(), because we'd get different behaviour. More here: https://discuss.python.org/t/pathlib-preserve-trailing-slash/33389

We could maybe add a RawUnnormalizingLocalPath class that doesn't perform normalization, which we could employ from shutil.rmtree().

barneygale commented 2 weeks ago

There's some more discussion here: https://discuss.python.org/t/incrementally-move-high-level-path-operations-from-shutil-to-pathlib/19208

I might revive that thread as it covers exactly this ticket. Perhaps adding these methods is only compelling as part of making PathBase public.

zooba commented 2 weeks ago

so we can't implement shutil.rmtree() using Path.rmtree(), because we'd get different behaviour.

Should we implement them separately, then? I suspect shutil is only going to work for "normal" filesystem paths, and so the best thing to do is for pathlib to have its own implementations in terms of its own functions, and for shutil to use __fspath__ and reject anything that doesn't return a file system path?

barneygale commented 2 weeks ago

Should we implement them separately, then?

That's what I'm intending to do in the PathBase implementations - they won't use shutil except for copyfileobj(). All the implementations will be generic and built on lower-level methods like PathBase.open(), symlink_to(), etc.

The story is more complicated in Path:

zooba commented 2 weeks ago
  • copy() should call shutil.copy2() if its argument is os.PathLike, because shutil.copy2() has extensive low-level code to properly copy metadata

It's not extensive, there's just a choice between using open and _winapi.CopyFile on Windows and some error handling. Having separate implementations on WindowsPath and PosixPath ought to make this look much simpler (though of course, we need to fall back on a read/write even in the Windows case, for implementations that may not replicate all of _winapi).

  • rmtree() should probably call shutil.rmtree() to make use of its fd-based safe traversal where available.

It can do this for WindowsPath and PosixPath, but the base implementation shouldn't use shutil because it can't be sure it's dealing with a file system. Using walk, unlink and rmdir here ought to be sufficient, and retrying a failed rmdir by re-walking the directory is probably as robust as it needs to be.

  • move() can also be inherited from PathBase (it uses copytree(), rmtree(), etc)

And we should also (not necessarily in this issue) add a WindowsPath override that uses MoveFileEx, which can already do cross-volume moves if requested.

barneygale commented 2 weeks ago
  • copy() should call shutil.copy2() if its argument is os.PathLike, because shutil.copy2() has extensive low-level code to properly copy metadata

It's not extensive, there's just a choice between using open and _winapi.CopyFile on Windows and some error handling.

It can call through to copystat(), which does more. It feels pretty icky to copy-paste all that into pathlib, right? Maybe I'm overreacting.

  • rmtree() should probably call shutil.rmtree() to make use of its fd-based safe traversal where available.

It can do this for WindowsPath and PosixPath, but the base implementation shouldn't use shutil because it can't be sure it's dealing with a file system. Using walk, unlink and rmdir here ought to be sufficient, and retrying a failed rmdir by re-walking the directory is probably as robust as it needs to be.

Yes that's what I mean, sorry for not being clear. Path.rmtree() calls shutil.rmtree(), but PathBase.rmtree() is a generic implementation atop walk(), unlink() and rmdir(). This is implemented in #119060

zooba commented 2 weeks ago

It can call through to copystat(), which does more.

Yeah, copystat is a bit nasty. Maybe it's something that deserves a (private?) os implementation? It seems more in line with that module to me.

Most of these functions seem to be careful error handling. I'm not sure we can do anything but copy paste all that.

barneygale commented 2 weeks ago

Most of these functions seem to be careful error handling. I'm not sure we can do anything but copy paste all that.

Couldn't we call shutil.copy2() from Path.copy() if the target is os.PathLike? That's how I've implemented it in #119058, and it avoids any copy-pasting!

zooba commented 2 weeks ago

os.PathLike is the criteria used by things outside of pathlib, not by pathlib itself, so I hope you're just using it casually. pathlib ought to know whether its own subclasses are for file systems or not, and if it's a user subclass, then the implementer should override.

But the main objection is to the architectural coupling between the modules - shutil should not be a dependency of pathlib, just as it should not be a dependency of os.path. The direction of the relationship flows the other way.

The best way to deal with it would be to move the implementations from shutil into pathlib and then update shutil to call them there (or move them into os or os.path).

barneygale commented 2 weeks ago

But the main objection is to the architectural coupling between the modules - shutil should not be a dependency of pathlib, just as it should not be a dependency of os.path. The direction of the relationship flows the other way.

I agree that shutil is higher-level than os, but I don't think that shutil is uniformly higher level than pathlib. I think they're mostly on about the same conceptual level, providing "high level" functions built on lower-level building blocks. e.g. pathlib has Path.glob() and Path.walk() that do recursive directory walking using low-level scandir(), and shutil has copytree() and rmtree() that do the same.

In some cases shutil is higher-level, particularly its archiving operations.

In other cases it's slightly lower-level IMO, particularly the implementations of copyfile(), copystat() and friends, which wouldn't seem out-of-place in os, and deal with low-level facets of the local filesystem (BSD symlink permissions, xattrs etc) that pathlib doesn't currently touch.

Well, that's my current view anyway. You have decades more experience on this than I do, so I'll see what it looks like to implement this without pathlib importing shutil :)

pfmoore commented 2 weeks ago

Very off-topic, but I seem to only be getting email notifications of @barneygale's comments, and nothing for any of @zooba's. Which makes following the conversation weirdly difficult... Does anyone know how this could even happen? I have checked, and Steve's messages aren't ending up in my spam folder, so I don't think it's a problem with my email client.

barneygale commented 2 weeks ago

The best way to deal with it would be to move the implementations from shutil into pathlib and then update shutil to call them there

Just to be clear, do you mean that shutil would call new module functions in pathlib, or that shutil would instantiate some kind of Path object and call appropriate methods?

Because if it's the latter, I'm pretty sure we're going to need a new path type (I'll call it "PedanticPath" here) that sits above Path in the hierarchy:

image

Unlike Path, this type would support bytes paths and skip all of pathlibs usual normalization, which means instantiating a PedanticPath object from a string is "lossless" and safe to do from shutil functions. It's actually pretty easy to implement, because PathBase also doesn't normalize by default.

(I think you probably meant module functions, but I wanted to check)

zooba commented 2 weeks ago

I seem to only be getting email notifications of @barneygale's comments, and nothing for any of @zooba's.

I wouldn't blame you for filtering my messages 😆 But no, I'm afraid I don't know of any reason other than that.

Just to be clear, do you mean that shutil would call new module functions in pathlib, or that shutil would instantiate some kind of Path object and call appropriate methods?

I was deliberately non-specific, because I don't know off the top of my head which would be best and I trust you to figure it out.

I'd half hoped that instantiating Path from shutil would be enough, but if you say it's not, then I'm a bit sad but I believe you. Bytes paths should be round-tripping through surrogateescape though, so there shouldn't be any loss. I do recall the discussions about trailing backslashes though.

Now you've shown what would have to happen to the hierarchy, I agree that module-level helper functions are probably best, and putting them in os probably makes even more sense. That module has already gone so far beyond exposing merely what POSIX specifies that I don't think we're breaking anything architecturally there, and the functions ought to only rely on os and os.path functions anyway.

barneygale commented 2 weeks ago

Trailing slashes are indeed the sticking point :-(. If I could go back in time I'd ask Antoine to retain them somehow when he designed pathlib, then go kill baby hitler etc.

I like the idea of moving the lowest-level stuff into os or os.path. I'll try that :-)

pfmoore commented 2 weeks ago

But no, I'm afraid I don't know of any reason other than that.

Would you do me a favour, and add a comment to https://github.com/pfmoore/editables/issues/24, just as somewhere I know I should see notifications from? It'll give me an idea if it's just this repo or more general.

pfmoore commented 2 weeks ago

Weird. I got the notification for that. So the problem is just on this issue (or maybe this project). I'll do some investigating.

barneygale commented 2 weeks ago

Continued on the forum: https://discuss.python.org/t/incrementally-move-high-level-path-operations-from-shutil-to-pathlib/19208/32