iustin / pyxattr

A python module for accessing filesystem Extended Attributes
https://pyxattr.k1024.org/
GNU Lesser General Public License v2.1
30 stars 15 forks source link

Release GIL when do I/O operations #16

Closed azraelxyz closed 6 years ago

azraelxyz commented 6 years ago

Release GIL in _list_obj, _get_obj, _set_obj, and _remove_obj. These functions do I/O operations which take long time. We should release GIL to allow other threads work.

iustin commented 6 years ago

Hi,

I tested the same a few years back, and for local filesystems this made things actually slower - as there's no bulk operation syscall, taking and releasing the GIL actually slowed down (non-trivially) the speed of large filesystem scans/changes.

Have you actually benchmarked what effect this has?

azraelxyz commented 6 years ago

Hi iustin,

I did some micro benchmark in my environment. It shows non-trivial performance impact before and after this commit as you mentioned above. Would you please kindly check it in your environment?

Here are my configuration and test results in seconds of setxattr, listxattr, getxattr, removexattr for 100k files.

MacBook Pro (13-inch, Mid 2012) Processor: 2.5GHz Intel Core i5 Memory: 16 GB 1600 MHz DDR3 Disk: Intel Solid State Drive 545s Series 256GB OS: OS X El Capitan 10.11.3

Before commit 3179f0f setxattr: 3.78645086288 listxattr: 1.78046584129 getxattr: 1.13265991211 removexattr: 4.69510889053

After commit 3179f0f setxattr: 3.68556499481 listxattr: 1.76518297195 getxattr: 1.35534405708 removexattr: 4.56970596313

My benchmark script is available here.

# mkdir /tmp/pyxattr; python ./xattr_bench.py /tmp/pyxattr/ 100000
import xattr
import sys
import time
import os

path = sys.argv[1]
num = int(sys.argv[2])
key = 'user.testxattr'

for i in range(num):
    filepath = os.path.join(path, str(i))
    with open(filepath, 'w') as fp:
        fp.write('x')

start = time.time()
for i in range(num):
    filepath = os.path.join(path, str(i))
    xattr.setxattr(filepath, key, 'test')
print 'setxattr:', time.time() - start

start = time.time()
for i in range(num):
    filepath = os.path.join(path, str(i))
    xattr.listxattr(filepath)
print 'listxattr:', time.time() - start

start = time.time()
for i in range(num):
    filepath = os.path.join(path, str(i))
    xattr.getxattr(filepath, key)
print 'getxattr:', time.time() - start

start = time.time()
for i in range(num):
    filepath = os.path.join(path, str(i))
    xattr.removexattr(filepath, key)
print 'removexattr:', time.time() - start

BTW, current pyxattr implementation is working perfectly on local file system such as ext4. However, on some file systems which are over network, it might be worthy to consider if we should release GIL when doing xattr operations.

iustin commented 6 years ago

FYI, this is what I get with a micro-benchmark in place (all numbers are microseconds), using Python 2.7 and multiple invocations to find best numbers - built-in variability is high:

Python 2.7:

Overall, the run-to-run variability (CPU scheduler, etc.) is higher or on the same order as before/after, so I think this can go in as is, irrespective of local-vs-remote filesystem. It could be possible that even on local filesystem with slow I/O, the gains would justify the trivial CPU increase.

iustin commented 6 years ago

Released in version 0.6.1.