python / cpython

The Python programming language
https://www.python.org
Other
62.17k stars 29.88k forks source link

shutil.copy2 does not copy sparse files efficiently #122122

Open dechamps opened 1 month ago

dechamps commented 1 month ago

Bug report

Bug description:

$ dd if=/dev/zero of=sparse_original bs=1 count=1 oseek=1M
1+0 records in
1+0 records out
1 byte copied, 2.7367e-05 s, 36.5 kB/s
$ cp sparse_original sparse_cp
$ python3 -c 'import shutil; shutil.copy2("sparse_original", "sparse_pycopy2");'
$ du sparse_original sparse_cp sparse_pycopy2 
4       sparse_original
4       sparse_cp
1028    sparse_pycopy2

Ideally shutil.copy2() should be capable of copying sparse files without unnecessarily expanding them (this is the default behavior of cp).

This could potentially be implemented using lseek(2) SEEK_DATA and SEEK_HOLE.

CPython versions tested on:

3.12

Operating systems tested on:

Linux

picnixz commented 1 month ago

Currently shutil.copy just read/writes the files and copies them (more or less efficiently using sendfile(2) if possible). In addition, SEEK_HOLE is not present on all kernel versions (and not all FS support it for the same kernel version).

So, unless you find a way to do it without those, I don't think we can make it happen for now =/ (well, we could by reading the entire file, looking for sparse regions, and then make it sparse as well, but I wouldn't do it by default though).

I will categorize this one as a feature rather than a bug (because it does what it should do, namely copying the file).

dechamps commented 1 month ago

SEEK_HOLE is not present on all kernel versions (and not all FS support it for the same kernel version)

Why not just try it and fall back to normal copy if it's not supported?

I wouldn't do it by default though

cp does it by default, so it doesn't seem unreasonable to me for a high-level Python copy library to do it as well. cp(1):

By default, sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well.

picnixz commented 1 month ago

I'll ask experts on those questions:

cc @barneygale (I know you work on path-related topics but I don't know if there are more people working on this topic).

My own opinion on the topic itself is that I don't mind the feature. But I don't know how much work it requires (I'm also not sure whether we want to duplicate the behaviour of cp actually).