python / cpython

The Python programming language
https://www.python.org
Other
62.67k stars 30.05k forks source link

Increase shutil.COPY_BUFSIZE #80284

Closed methane closed 5 years ago

methane commented 5 years ago
BPO 36103
Nosy @giampaolo, @methane, @desbma
PRs
  • python/cpython#12115
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['3.8', 'library', 'performance'] title = 'Increase shutil.COPY_BUFSIZE' updated_at = user = 'https://github.com/methane' ``` bugs.python.org fields: ```python activity = actor = 'methane' assignee = 'none' closed = True closed_date = closer = 'methane' components = ['Library (Lib)'] creation = creator = 'methane' dependencies = [] files = [] hgrepos = [] issue_num = 36103 keywords = ['patch'] message_count = 10.0 messages = ['336505', '336523', '336533', '336546', '336623', '336643', '336648', '336685', '336987', '336988'] nosy_count = 3.0 nosy_names = ['giampaolo.rodola', 'methane', 'desbma'] pr_nums = ['12115'] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'resource usage' url = 'https://bugs.python.org/issue36103' versions = ['Python 3.8'] ```

    methane commented 5 years ago

    shutil.COPY_BUFSIZE is 16KiB on non-Windows platform. But it seems bit small for performance.

    As this article1, 128KiB is the best performance on common system.

    Another resource: EBS document 2 uses 128KiB I/O for throughput.

    Can we increase shutil.COPY_BUFSIZE to 128KiB by default?

    Note that 128KiB is small enough when comparing with Windows (1MB by default).

    3b6a3ebd-2680-4fd8-9657-0a347350b1fc commented 5 years ago

    Your first link explains why 128kB buffer size is faster in the context of cp: it's due to fadvise and kernel read ahead.

    None of the shutil functions call fadvise, so the benchmark and conclusions are irrelevant to the Python buffer size IMO.

    In general, the bigger buffer, the better, to reduce syscall frequency (also explained in the article), but going from 16kB to 128kB is clearly in the micro optimization range, unlikely to do any significant difference.

    Also with 3.8, in many typical file copy cases (but not all), sendfile will be used, which makes buffer size even less important.

    methane commented 5 years ago

    Your first link explains why 128kB buffer size is faster in the context of cp: it's due to fadvise and kernel read ahead.

    None of the shutil functions call fadvise, so the benchmark and conclusions are irrelevant to the Python buffer size IMO.

    Even without fadvice, readahead works automatically. fadvice doubles readahead size on Linux. But I don't know it really doubles readahead size when block device advertised readahead size.

    In general, the bigger buffer, the better, to reduce syscall frequency (also explained in the article), but going from 16kB to 128kB is clearly in the micro optimization range, unlikely to do any significant difference.

    Also with 3.8, in many typical file copy cases (but not all), sendfile will be used, which makes buffer size even less important.

    It is used for copyfileobj. So better default value may worth enough.

    In my Linux box, SATA SSD (Samsung SSD 500GB 860EVO) is used. It has unstable sequential write performance.

    Here is quick test:

    $ dd if=/dev/urandom of=f1 bs=1M count=1k
    
    $ ./python -m timeit -n1 -r5 -v -s 'import shutil;' -- 'f1=open("f1","rb"); f2=open("/dev/null", "wb"); shutil.copyfileobj(f1, f2, 8*1024); f1.close(); f2.close()'
    raw times: 301 msec, 302 msec, 301 msec, 301 msec, 300 msec

    1 loop, best of 5: 300 msec per loop

    $ ./python -m timeit -n1 -r5 -v -s 'import shutil;' -- 'f1=open("f1","rb"); f2=open("/dev/null", "wb"); shutil.copyfileobj(f1, f2, 16*1024); f1.close(); f2.close()'
    raw times: 194 msec, 194 msec, 193 msec, 193 msec, 193 msec

    1 loop, best of 5: 193 msec per loop

    $ ./python -m timeit -n1 -r5 -v -s 'import shutil;' -- 'f1=open("f1","rb"); f2=open("/dev/null", "wb"); shutil.copyfileobj(f1, f2, 32*1024); f1.close(); f2.close()'
    raw times: 140 msec, 140 msec, 140 msec, 140 msec, 140 msec

    1 loop, best of 5: 140 msec per loop

    $ ./python -m timeit -n1 -r5 -v -s 'import shutil;' -- 'f1=open("f1","rb"); f2=open("/dev/null", "wb"); shutil.copyfileobj(f1, f2, 64*1024); f1.close(); f2.close()'
    raw times: 112 msec, 112 msec, 112 msec, 112 msec, 112 msec

    1 loop, best of 5: 112 msec per loop

    $ ./python -m timeit -n1 -r5 -v -s 'import shutil;' -- 'f1=open("f1","rb"); f2=open("/dev/null", "wb"); shutil.copyfileobj(f1, f2, 128*1024); f1.close(); f2.close()'
    raw times: 101 msec, 101 msec, 101 msec, 101 msec, 101 msec

    As far as this result, I think 64KiB is the best balance.

    3b6a3ebd-2680-4fd8-9657-0a347350b1fc commented 5 years ago

    If you do a benchmark by reading from a file, and then writing to /dev/null several times, without clearing caches, you are measuring *only* the syscall overhead:

    Your current command line also measures open/close timings, without that I think the speed should linearly increase when doubling buffer size, but of course this is misleading, because its a synthetic benchmark.

    Also if you clear caches in between tests, and write the output file to the SSD itself, sendfile will be used, and should be even faster.

    So again I'm not sure this means much compared to real world usage.

    methane commented 5 years ago

    desbma \dutch109@gmail.com\ added the comment:

    If you do a benchmark by reading from a file, and then writing to /dev/null several times, without clearing caches, you are measuring *only* the syscall overhead:

    • input data is read from the Linux page cache, not the file on your SSD itself

    Yes. I measures syscall overhead to determine reasonable buffer size. shutil may be used when page cache is warm.

    • no data is written (obviously because output is /dev/null)

    As I said before, my SSD doesn't have stable write performance. (It is typical for consumer SSD). So this is intensional. And there are use cases copy from/to io.BytesIO or other file-like objects.

    Your current command line also measures open/close timings, without that I think the speed should linearly increase when doubling buffer size, but of course this is misleading, because its a synthetic benchmark.

    I'm not measuring speed of my cheap SSD. The goal of this benchmark is finding reasonable buffer size. There are vary real usages. So reducing syscall overhead with reasonable buffer size is worth enough.

    Also if you clear caches in between tests, and write the output file to the SSD itself, sendfile will be used, and should be even faster.

    No. sendfile is not used by shutil.copyfileobj, even if dst is real file on disk.

    So again I'm not sure this means much compared to real world usage.

    "Real world usage" is vary. Sometime it is not affected. Sometime it affects.

    On the other hand, what is the cons of changing 16KiB to 64KiB? Windows used 1MiB already. And CPython runtime uses a few MBs of memory too.

    giampaolo commented 5 years ago

    @Inada: having played with this in the past I seem to remember that on Linux the bigger bufsize doesn't make a reasonable difference (but I may be wrong), that's why I suggest to try some benchmarks. In bpo-33671 I pasted some one-liners you can use (and you should target copyfileobj() instead of copyfile() in order to skip the os.sendfile() path). Also on Linux "echo 3 | sudo tee /proc/sys/vm/drop_caches" is supposed to disable the cache.

    methane commented 5 years ago

    Also on Linux "echo 3 | sudo tee /proc/sys/vm/drop_caches" is supposed to disable the cache.

    As I said already, shutil is not used only with cold cache.

    If cache is cold, disk speed will be upper bound in most cases. But when cache is hot, or using very fast NVMe disk, syscall overhead can be non-negligible.

    methane commented 5 years ago

    Read this file too. http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/ioblksize.h

    coreutils choose 128KiB for *minimal* buffer size to reduce syscall overhead. In case of shutil, we have Python interpreter overhead adding to syscall overhead. Who has deeper insights than coreutils author?

    I think 128KiB is the best, but I'm OK to 64KiB for conservative decision.

    methane commented 5 years ago

    New changeset 4f1903061877776973c1bbfadd3d3f146920856e by Inada Naoki in branch 'master': bpo-36103: change default buffer size of shutil.copyfileobj() (GH-12115) https://github.com/python/cpython/commit/4f1903061877776973c1bbfadd3d3f146920856e

    methane commented 5 years ago

    I chose 64 KiB because performance difference between 64 and 128 KiB I can see is only up to 5%.