python / cpython

The Python programming language
https://www.python.org
Other
63.5k stars 30.42k forks source link

"mmap.flush()" is always synchronous, hurting performance #63016

Open jcea opened 11 years ago

jcea commented 11 years ago
BPO 18816
Nosy @jcea, @MojoVampire

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['extension-modules', 'easy', 'type-feature'] title = '"mmap.flush()" is always synchronous, hurting performance' updated_at = user = 'https://github.com/jcea' ``` bugs.python.org fields: ```python activity = actor = 'josh.r' assignee = 'none' closed = False closed_date = None closer = None components = ['Extension Modules'] creation = creator = 'jcea' dependencies = [] files = [] hgrepos = [] issue_num = 18816 keywords = ['easy'] message_count = 3.0 messages = ['195941', '195948', '195971'] nosy_count = 3.0 nosy_names = ['jcea', 'neologix', 'josh.r'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue18816' versions = ['Python 3.4'] ```

jcea commented 11 years ago

Currently, "mmap.flush()" does a synchronous write to the backend file. The call will wait until data is actually flushed to disk, because internally it is doing a "msync(MS_SYNC)".

But the value of "mmap.flush()" is to synchronize file and memory. You don't need a synchronous write in the general case.

I propose to add an optional keyword parameter with default value "SYNC" (compatibility) but that can be "ASYNC", "INVALIDATE" (can be "SYNC|INVALIDATE" and "ASYNC|INVALIDATE" too).

I am talking about UNIX MMAP. No idea about Windows.

Check "man msync" for useful cases.

79528080-9d85-4d18-8a2a-8b1f07640dd7 commented 11 years ago

I propose to add an optional keyword parameter with default value "SYNC" (compatibility) but that can be "ASYNC", "INVALIDATE" (can be "SYNC|INVALIDATE" and "ASYNC|INVALIDATE" too).

AFAICT it's mostly useless on a modern OS. MS_INVALIDATE is a no-op on systems with merged VM-buffer cache, i.e. it's not needed for mmap() to reflect write() and vice-versa.

So nothing's normally needed to "synchronize file and memory".

As for MS_ASYNC, it actually doesn't do anything at all on recent OS, e.g. it's a no-op on Linux since a couple years, since modified pages will be written back as part of the normal writeback process.

The only thing a user might actually need for an mmap object is to make sure data is actually committed to disk, and MS_SYNC covers this.

See e.g. this post by Andrew Morton: http://thread.gmane.org/gmane.linux.kernel/1312660

jcea commented 11 years ago

Depending of a concrete OS implementation is not good. Linux is not the only OS out there, and I have very old machines in production yet:

""" # uname -a Linux colquide.XXXX.es 2.4.37 #4 Fri Dec 12 01:10:45 CET 2008 i686 unknown """

I have been hit by the VM/file cache split in the past. Portability is important.

Anyway, the Python "mmap" manual says that "mmap.flush()" is needed to be sure that you are not going to "lose" changes you made in the mmap. On "modern" OSs it is not actually needed, as you say, and the performance hit is important enough for me to investigate and write this enhancement proposal :).