nim-lang / RFCs

A repository for your Nim proposals.
135 stars 26 forks source link

faster `copyFile` via zero-copy API's (`copyfile`, `splice` etc) #330

Open timotheecour opened 3 years ago

timotheecour commented 3 years ago

EDIT: here's additional data: http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/ioblksize.h?id=c0a79542fb5c2c22cf0a250db94af6f8581ca342#n23

As of May 2014, 128KiB is determined to be the minimium blksize to best minimize system call overhead. This can be tested with this script: [...]

(tested on several systems) and https://github.com/coreutils/coreutils/blob/master/src/ioblksize.h#L23-L57

links

future work

Araq commented 3 years ago

Not that I mind these ideas, but the stdlib doesn't claim to provide the fastest (or "best") way of copying files around. And an implementation should strive for small code size and maintainability.

juancarlospaco commented 3 years ago

So far the diff is tiny, and most lines are tests anyway. :)

timotheecour commented 3 years ago

but the stdlib doesn't claim to provide the fastest (or "best") way of copying files around

I think the stance should be:

And an implementation should strive for small code size and maintainability.

all else being equal, yes, but not at the expense of performance where it matters. Small (generated asm) code size is what the compiler --lean is for (ping on https://github.com/nim-lang/Nim/pull/14282 which introduces this generally useful flag), and in future work can be used in more places to provide leaner implementations in places where this matters most.

rominf commented 3 years ago

I'm proposing to change 2nd proposal: I suggest using sendfile instead of splice, as it's more specific and easier to use, and also sendfile is implemented using splice, see: https://stackoverflow.com/a/4483342/2108548 https://code.woboq.org/linux/linux/fs/read_write.c.html#do_sendfile

We can find a readable example of sendfile usage in stdlib of Julia: https://github.com/JuliaLang/julia/blob/6468dcb04ea2947f43a11f556da9a5588de512a0/base/filesystem.jl#L116

What do you think?

timotheecour commented 3 years ago

I suggest using sendfile instead of splice

https://stackoverflow.com/a/7464280/1426932

Unfortunately, you cannot use sendfile() here because the destination is not a socket. (The name sendfile() comes from send() + "file").

rominf commented 3 years ago

In Linux kernels before 2.6.33, out_fd must refer to a socket. Since Linux 2.6.33 it can be any file. If it is a regular file, then sendfile() changes the file offset appropriately. See: https://man7.org/linux/man-pages/man2/sendfile.2.html

PS: StackOverflow is not the absolute source of Truth :-)

rominf commented 3 years ago

And yes, I would use sendfile, as even Debian oldstable comes with a newer version of kernel.

timotheecour commented 3 years ago

Yes, you are correct, I should've double checked ; just added a comment in https://stackoverflow.com/questions/7463689/most-efficient-way-to-copy-a-file-in-linux/7464280#comment117018927_7464280

timotheecour commented 3 years ago

@rominf you'll also need to check for interrupted writes and restart from where it left off (maxes out at 2GB IIRC) along with the usual EINTR etc.

rominf commented 3 years ago

Is it OK to rewrite this code: https://github.com/python/cpython/blob/1b57426e3a7842b4e6f9fc13ffb657c78e5443d4/Lib/shutil.py#L114?

timotheecour commented 3 years ago

check also boot::filesystem, it recently added sendfile support to speedup their copy_file API: refs https://github.com/boostorg/filesystem/commit/9182b4caa34f14a246f3bcd6cae5ad9cb270682d (but look at latest sources instead of that commit)

note also this code:


#if defined(BOOST_FILESYSTEM_USE_SENDFILE)
    // sendfile started accepting file descriptors as the target in Linux 2.6.33
    if (major > 2u || (major == 2u && (minor > 6u || (minor == 6u && patch >= 33u))))
      cfd = &copy_file_data_sendfile;
#endif

#if defined(BOOST_FILESYSTEM_USE_COPY_FILE_RANGE)
    // Although copy_file_range appeared in Linux 4.5, it did not support cross-filesystem copying until 5.3
    if (major > 5u || (major == 5u && minor >= 3u))
      cfd = &copy_file_data_copy_file_range;
#endif

which favors using the more recent copy_file_range for recent enough linux kernel version.

refs

sendfile() only works if the source file descriptor refers to something that can be mmap()ed (i.e. mostly normal files)

FedericoCeratto commented 3 years ago

sendfile is fast and has been around for a long time and it should be used by the stdlib also for sockets and [async]httpserver. Related: #304 https://github.com/nim-lang/Nim/issues/9716#issuecomment-439053314 https://github.com/nim-lang/Nim/issues/4334#issuecomment-225906512

Also, copy_file_range is even more efficient on COW filesystems by not duplicating data on disk.