Performance improvements through platform-specific kernel APIs.

Hey! With our brief discussion yesterday, I wanted to better explain what exactly I'm working towards with #160. I would also like to use this issue as a more focused space to discuss the overall design instead of the implementation for just MacOS.

My overall goal is to make copy as fast as possible by taking advantage of features provided by different operating systems.

About Previous PRs

I'm thinking of reworking my previous pull request to be for the design that allows copy to use these features. Then, I can open up a new pull request for each single optimization.

You made some very good points about the problems with my original design, and looking back on it, it focused too much on providing abstraction/flexibility. Yours is a lot simpler, but I feel that it sees the feature as more of an afterthought than a core part of copy. I would like work together on a new one that is: simpler, well-integrated, and able to work within the different limitations of each supported platform's API.

Considerations

Why Fallback

Not all of the platform-specific optimizations are equally as good. They also can't be combined.

In order of best performance first, we have:

Copy on write (let the kernel make a new link to the file)
In-kernel copying (avoid O(n) context switching)
Read and write copies (the current way)

The end result of copying the file is the same, but the way they work is different. Copy-on-write has limitations that the others do not. Then, we also need to consider copying when the source is fs.FS or when using WrapReader since they can't use any of these optimizations.

Because of that, I think taking a fallback approach would be ideal. If copy-on-write can't be used, use in-kernel copying. If in-kernel copying can't be used, do read and write copying.

It's also better to follow the easier to ask for forgiveness approach here. By not checking if something is possible in Go code first and just trying it, it's both less code and faster—the kernel will do the checks itself whether or not we check it in Go first.

Platform API Inconsistencies

Linux and MacOS have different ways of exposing the copy-on-write file API. Depending on that, using them has to be in one of two places:

Before files are opened, as with clonefile. This is because they only take file paths or a dirfd.
After files are opened, when a file descriptor is held. This is because they only take file descriptors, like for sendfile and Linux copy-on-write.

Optimization 1: Darwin Copy-on-Write

Using clonefile.

Manpage

Considerations:

It uses file paths.
It keeps file times.
It removes setuid and setgid permissions.
It requires filesystem support.
It requires src and dest are on the same filesystem.
It fails if the dest file already exists.

Optimization 2: Linux Copy-on-Write

Using the FICLONE ioctl.

Manpage

Considerations:

It uses file descriptors.
It requires filesystem support.
It requires src and dest are on the same filesystem.
(Probably more to find when testing.)

Optimization 3: Linux In-Kernel Copying

Using the sendfile syscall.

Manpage

Considerations:

It's still copying the bytes on the storage.
It will work across filesystems.

Remaining Questions

Is it enabled by default?

Since these are optimizations, they shouldn't change how copy works. Because of that, it would be possible to enable them by default.

However: it's impossible to know how every user is calling Copy, and there's always a chance that the optimization will cause a regression in their use case.

That can be avoided by making it an Option that they enable, or by following Go' convention to create a new copy/v2 package for breaking changes.

How to test the specific optimizations?

We would need a way to test that copying with one of the optimizations has the same expected behavior as doing read-and-write copying.

I did this in #160 with a go test build tag, but maybe there's a better way?

So structured! I love how you gathered the possible optimization solutions.

Thank you so much. Yes, we should consider the fact that there are a bunch of ways of optimization and we are not able to cover them all in our code. That's why I'm trying to give users controllability all the time.

Let me share my thoughts to your questions

Is it enabled by default? // No

Users should have control over how files are copied. Our role is to provide them with the controllability and example functions they can use, such as plugins/copyonwrite. In my opinion, we should avoid imposing our implementation.

Users can define some fallback mechanism like:

// Users use-case

import (
    cp "github.com/otiai10/copy"
    "github.com/otiai10/copy/plugins/copyonwrite"
    "github.com/someone/someplugins/copybuffer"
)

func MyCopyFunc(src, dst string) error {
    if err := copyonwrite.CopyFile(src, dst); err != nil {
        return copybuffer.CopyFile(src, dst)
    }
    return nil
}

func main() {
    opt := cp.Options{ FileCopyFunc: MyCopyFunc }
}

In this way, it is users responsibility to think about how to fallback and also how to handle their environment which might not support copy-on-write.

How to test the specific optimization? // By using `FileCopyFunc` option

as demonstrated in #161

// In our test code

func TestOptions_FileCopyFunc_macOS(t *testing.T) {
    if runtime.GOOS != "darwin" {
        t.Skip("...")
    }
    opt_default := Options{}
    // ....
    opt_cow := Options{ FileCopyFunc: /* ... */ }
    // ...

    // and compare the results of both
}

Explicitness for users

Controllability for users

Readability for us; developers

Following those project goals, what do you think about this?

Start off from the idea you had in #161
Put the copy implementations into copy instead of copy/plugin/*[^1]
Convert FileCopyFunc func(src, dest string) error into opaque type[^2] FileCopyMethod[^3]
Make fcopyByReadAndWrite into a FileCopyMethod called CopyBytes, just like CopyOnWrite
Make the default FileCopyMethod be CopyBytes
Add a new Method called OptimizedCopy that falls back from CopyOnWrite -> CopyInKernel -> CopyBytes (fastest to slowest)[^4]

Example

`Options` / `Method` code

type Options struct {
    // ...

    FileCopyFunc FileCopyMethod

    // ...
}

type FileCopyMethod struct {
    fcopy func(src, dest string, info os.FileInfo, opt Options) (err error)
}

var CopyBytes = FileCopyMethod { fcopy: fcopyByReadAndWrite }
var CopyOnWrite = FileCopyMethod { fcopy: platformCopyOnWrite }

Usage

copy.Copy("from", "to", copy.Options {
    FileCopyMethod: copy.CopyOnWrite
})

`fcopy` code

func fcopy(src, dest string, info os.FileInfo, opt Options) (err error) {
    if err = os.MkdirAll(filepath.Dir(dest), os.ModePerm); err != nil {
        return
    }

    if err := opt.FileCopyMethod(src, dest, info, opt); err != nil {
        return err
    }

    // ----- below is same as your code -----
    chmodfunc, err := opt.PermissionControl(info, dest)
    if err != nil {
        return err
    }
    chmodfunc(&err)
    if os.IsNotExist(err) {
        // See https://github.com/otiai10/copy/issues/72
        // Gracefully handle files/symlinks/dirs deleted while copy
        return nil
    }

    if opt.PreserveOwner {
        if err := preserveOwner(src, dest, info); err != nil {
            return err
        }
    }
    if opt.PreserveTimes {
        if err := preserveTimes(info, dest); err != nil {
            return err
        }
    }

    return
}

`OptimizedCopy`

var OptimizedCopy = FileCopyMethod {
    fcopy: func (src, dest string, info os.FileInfo, opt Options) (err error) {
        if err := platformCopyOnWrite(src, dest, info, opt); err != nil {
            return nil // fastest way worked
        }

        // all faster ways didn't work, try slowest but more compatible one
        return fcopyByReadAndWrite(src, dest, info, opt)
    }
}

[^1]: The way I see it, these are optimizations. Optimizations are part of the package, while plugins are an API for users to add their own functions inside the package. [^2]: An opaque type being type struct { privateImpl: func() } PublicType; var PublicTypeInstance = /* ... */. The user can pick it, but they can't change how it works or make a new one. You can change it to be not opaque later without it being a breaking change. [^3]: Giving the user too much controllability here adds extra design questions: "how does this work if Options.FS is set?" (never call the FileCopyFunc function? return an error?) / "should FileCopyFunc respect Options.Sync and Options.Reader?" (if so, how does the FileCopyFunc get the Options?) [^4]: It's very important that the user can pick the "fastest option that works" without making it himself/herself. We can't assume the normal user has experience with platform-specific optimizations and knows the best one to choose. We also can't assume they know the limitations for each one, but just want something that's fast when it can be & works when they expect it.

Overall, sounds nice to me. Let's do that 👍 We had very good discussion to make it better.

Could you please open another PR if you're ready, and let us talk about the actual code.

Just FYI: To answer your question, even though it might be useless any more 😝 My answers could be like following.

"3. how does this work if Options.FS is set?"

// Users code

func main() {
    xxx := func(s, d string) error {
        // handle FS
        // do what users want
    }
    opt.FileCopyFunc = xxx
}

"4. We can't assume the normal user has experience with platform-specific optimizations and knows the best one to choose. We also can't assume they know the limitations for each one, but just want something that's fast when it can be & works when they expect it"

This is basically different from what I believe. Users SHOULD learn and know. Packages SHOULD NOT do spoon-feeding to them. If we do that, we're gonna have a ton of issues raised by users. If they don't learn and know, it's reasonable that they use the primitive way (CopyBytes in this case) without any doubt. Anyhow, there should be a way to customize, which I called "controllability".

This is basic philosophy I have when I maintain OSS.

To respond to your answers,

func main() {
    xxx := func(s, d string) error {
        // handle FS
        // do what users want
    }
    opt.FileCopyFunc = xxx
}

I understand what you're doing with the capturing closure there, but I think I would do it differently if it were up to me. Capturing closures have a time and a place, but I think using them to keep input parameters in scope is less clear and more unsafe:

// in someone else's package
func MakeCoolCopyFunc(fs fs.FS, options CoolOptions) (func(s, d string) error) {
    return func(s, d string) error {
        // do things with fs
    }
}

// in user's code
func main() {
    fsys := os.DirFs("/path/to/dir")
    copyFunc := MakeCoolCopyFunc(fsys, CoolOptions {
        // ...
    })

    // later:
    copy.Copy("/src", "/dest", copy.Options {
        FS: fsys,
        FileCopyFunc: coolCopyFunc,
    }

    // even later:
    otherFsys := os.DirFs("/other/path")
    copy.Copy("/src", "/dest", copy.Options {
        FS: otherFsys,
        FileCopyFunc: coolCopyFunc, // <-- has wrong fs.FS captured
    })
}

This is basically different from what I believe. Users SHOULD learn and know. Packages SHOULD NOT do spoon-feeding to them. If we do that, we're gonna have a ton of issues raised by users. If they don't learn and know, it's reasonable that they use the primitive way (CopyBytes in this case) without any doubt.

I usually agree, but it depends much on the level of abstraction of the package and if the code is for a feature or an optimization.

For example, a super simple C memcpy function:

void simple_memcpy(char* dest, const char* src, size_t length) {
    for (size_t i = 0; i < length; i++) {
        dest[i] = src[i];
    }
}

It can be faster with amd64 AVX2 vector instructions[^1]:

void vector_memcpy(char* dest, const char* src, size_t length) {
    assert(/* CPU model supports avx2 extensions */);
    assert(size % 32 == 0);
    while(size) {
        _mm256_store_si256 ((__m256i*)dest, _mm256_load_si256((__m256i const*)src));
        src += 32;
        dest += 32;
        size -= 32;
    }
}

But, it has 2 asserts:

The CPU is new enough to support avx2.
The copy length is 32, 64, 96, etc.

It won't work everywhere, but simple_memcopy will. When it does work, both _memcpy functions do the same thing but in a different way.

simple_memcpy and vector_memcpy are low-level memcpy implementations. The memcpy that the user has is a higher level abstraction that calls a lower-level implementation[^2]. Example:

char* memcpy(char* dest, const char* src, size_t length) {
    // use faster memcpy where possible
    if (/* CPU model supports avx2 extensions */) {
        size_t lengthDiv32 = length / 32;
        vector_memcpy(dest, src, lengthDiv32);

        src += lengthDiv32;
        dest += lengthDiv32;
        length -= lengthDiv32;
    }

    // fallback to slower memcpy 
    simple_memcpy(dest, src, length);
}

With that example, the way I see the copy package is as memcpy. The OptimizedCopy is the higher level abstraction that picks the best way to copy a file, and CopyOnWrite/CopyBytes is one of the different ways it can use.

[^1]: Code adapted from this article.

[^2]: Even glibc's memcpy calls 1 of 2 different functions.

otiai10 / copy