improved allocation of PyUnicode objects

pitrou commented 16 years ago

BPO	1943
Nosy	@malemburg, @gvanrossum, @rhettinger, @terryjreedy, @amauryfa, @mdickinson, @pitrou, @vstinner, @ericvsmith, @devdanzin, @benjaminp, @orivej, @ezio-melotti
Files	stringbench3k.py: a quick port to py3k of stringbench.py unialloc4.patch freelists2.patch unialloc5.patch unialloc6.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['interpreter-core', 'type-feature', 'expert-unicode'] title = 'improved allocation of PyUnicode objects' updated_at = user = 'https://github.com/pitrou' ``` bugs.python.org fields: ```python activity = actor = 'vstinner' assignee = 'none' closed = True closed_date = closer = 'vstinner' components = ['Interpreter Core', 'Unicode'] creation = creator = 'pitrou' dependencies = [] files = ['9297', '9789', '9790', '14048', '19142'] hgrepos = [] issue_num = 1943 keywords = ['patch'] message_count = 96.0 messages = ['61728', '61730', '61732', '61735', '61738', '61740', '61741', '61743', '61747', '61832', '61862', '61867', '62332', '62467', '64108', '64112', '64161', '64163', '64165', '64166', '64167', '64168', '64169', '64170', '64194', '64197', '64215', '64216', '64320', '87912', '87917', '87918', '88254', '88287', '88288', '88290', '88291', '88307', '88309', '88310', '88312', '88313', '88584', '88585', '88744', '88745', '88746', '88751', '88753', '88767', '88768', '88774', '88775', '88777', '88779', '88801', '88802', '88804', '88806', '88816', '88819', '88824', '88830', '88879', '88880', '88881', '88883', '88885', '88928', '88936', '88946', '88951', '88953', '88983', '89069', '97545', '97553', '97554', '97555', '97559', '97581', '98656', '98657', '98662', '98668', '98669', '98672', '98676', '98677', '98686', '110599', '116937', '116950', '118087', '134406', '144625'] nosy_count = 19.0 nosy_names = ['lemburg', 'gvanrossum', 'collinwinter', 'rhettinger', 'terry.reedy', 'jafo', 'jimjjewett', 'amaury.forgeotdarc', 'mark.dickinson', 'Rhamphoryncus', 'pitrou', 'vstinner', 'eric.smith', 'ferringb', 'ajaksu2', 'benjamin.peterson', 'orivej', 'ezio.melotti', 'BreamoreBoy'] pr_nums = [] priority = 'normal' resolution = 'fixed' stage = 'patch review' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue1943' versions = ['Python 3.2'] ```

pitrou commented 16 years ago

This is an attempt at improving allocation of str (PyUnicode) objects. For that it does two things:

make str objects PyVarObjects, so that the object is allocated in one pass rather than two
maintain an array of freelists, each for a given str length, so that many str allocations can bypass actual memory allocation calls

There is a \~10% speedup in stringbench, and a slight improvement in pybench (as invoked with "./python Tools/pybench/pybench.py -t String"). Also, memory consumption is a bit lower when running those benchmarks.

malemburg commented 16 years ago

The Unicode object was designed not to be a PyVarObject (in contrast to the string object), so I'm not sure what you would want to break that design in return for a mere 10% speedup.

Note that turning the objects into PyVarObjects removes the ability to extend the type at C level. It also introduces lots of problems if you want to resize a Unicode object, due to the fact that the memory address of the whole PyUnicodeObject can change. Resizing is done quite often in codecs.

Regarding your second point: Unicode objects already use a free list and even keep the allocated Py_UNICODE buffer allocated (if it's not too large). Furthermore, the allocation is handled by pymalloc, so you only rarely need to go to the system malloc() to allocate more space.

Tuning the KEEPALIVE_SIZE_LIMIT will likely have similar effects w/r to speedup.

pitrou commented 16 years ago

I just tried bumping KEEPALIVE_SIZE_LIMIT to 200. It makes up for a bit of the speedup, but the PyVarObject version is still faster. Moreover, and perhaps more importantly, it eats less memory (800KB less in pybench, 3MB less when running the whole test suite).

(then there are of course microbenchmarks. For example: # With KEEPALIVE_SIZE_LIMIT = 200 ./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()" 1000 loops, best of 3: 556 usec per loop # With this patch ./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()" 1000 loops, best of 3: 391 usec per loop )

I don't understand the argument for codecs having to resize the unicode object. Since codecs also have to resize the bytes object when encoding, the argument should apply to bytes objects as well, yet bytes (and str in 2.6) is a PyVarObject.

I admit I don't know the exact reasons for PyUnicode's design. I just know that the patch doesn't break the API, and runs the test suite fine. Are there any specific things to look for?

malemburg commented 16 years ago

Your microbenchmark is biased towards your patched version. The KEEPALIVE_SIZE_LIMIT will only cut in when you deallocate and then reallocate Unicode objects. The free list used for Unicode objects is also limited to 1024 objects - which isn't all that much. You could tune MAX_UNICODE_FREELIST_SIZE as well.

Regarding memory usage: this is difficult to measure in Python, since pymalloc will keep memory chunks allocated even if they are not in use by Python. However, this is a feature of pymalloc and not specific to the Unicode implementation. It can be tuned in pymalloc. To get more realistic memory measurements, you'd have to switch off pymalloc altogether and then create a separate process that consumes lots of memory to force the OS to have it allocate only memory that's really needed to the process you're running for memory measurements. Of course, keeping objects alive in a free list will always use more memory than freeing them altogether and returning the memory to the OS. It's a speed/space tradeoff. The RAM/CPU costs ratio has shifted a lot towards RAM nowadays, so using more RAM is usually more efficient than using more CPU time.

Regarding resize: you're right - the string object is a PyVarObject as well and couldn't be changed at the time due to backwards compatibility reasons. You should also note that when I added Unicode to Python 1.6, it was a new and not commonly used type. Codecs were not used much either, so there was no incentive to make resizing strings work better. Later on, other optimizations were added to the Unicode implementation that caused the PyUnicode_Resize() API to also require being able to change the object address. Still, in the common case, it doesn't change the object address.

The reason for using an external buffer for the Unicode object was to be able to do further optimizations, such as share buffers between Unicode objects. We never ended up using this, though, but there's still a lot of room for speedups and more memory efficiency because of this design.

Like I already mentioned, PyObjects are also easier to extend at C level

adding new variables to the object at the end is easy with PyObjects. It's difficult for PyVarObjects, since you always have to take the current size of the object into account and you always have to use indirection to get at the extra variables due to the undefined offset of the variables.

How much speedup do you get when you compare the pybench test with KEEPALIVE_SIZE_LIMIT = 200 compared to your patched version ?

pitrou commented 16 years ago

With KEEPALIVE_SIZE_LIMIT = 200, the pybench runtime is basically the same as with my patched version. stringbench remains a bit faster though (~8%).

You say that RAM size is cheaper than CPU power today, which is true but misses one part of the picture: the importance of CPU caches, and thus of working set size. There are also probably people wanting to use Python in memory-constrained environments (embedded).

I understand the argument about possible optimizations with an external buffer, but are such optimizations likely to be implemented? (see bpo-1590352 and bpo-1629305). If they really are, then I'm happy with the unicode type remaining a plain PyObject!

malemburg commented 16 years ago

I don't really see the connection with bpo-1629305.

An optimization that would be worth checking is hooking up the Py_UNICODE pointer to interned Unicode objects if the contents match (e.g. do a quick check based on the hash value and then a memcmp). That would save memory and the call to the pymalloc allocator.

Another strategy could involve a priority queue style cache with the aim of identifying often used Unicode strings and then reusing them.

This could also be enhanced using an offline approach: you first run an application with an instrumented Python interpreter to find the most often used strings and then pre-fill the cache or interned dictionary on the production Python interpreter at startup time.

Coming from a completely different angle, you could also use the Py_UNICODE pointer to share slices of a larger data buffer. A Unicode sub-type could handle this case, keeping a PyObject* reference to the larger buffer, so that it doesn't get garbage collected before the Unicode slice.

Regarding memory constrained environments: these should simply switch off all free lists and pymalloc. OTOH, even mobile phones come with gigabytes of RAM nowadays, so it's not really worth the trouble, IMHO.

pitrou commented 16 years ago

All of those proposals are much heavier to implement; they also make the unicode type more complicated and difficult to maintain probably (while turning it into a PyVarObject actually shortens the implementation a bit). In any case, it's not something I want to tackle.

The reason that I mentioned bpo-1629305 was that it was such an optimization which complicated the unicode type while bringing very significant speedups.

malemburg commented 16 years ago

Agreed, those optimizations do make the implementation more complicated. It's also not clear whether they would really be worth it.

bpo-1629305 only provided speedups for the case where you write s += 'abc'. The usual idiom for this is to use a list and then concatenate in one go. If you want a really fast approach, you'd use cStringIO or perhaps the bufferarray. I don't think that optimizing for just one particular use case warrants making the code more complicated or changing the C interface radically.

In your case, I think that closing the door for being able to easily extend the type implement at the C is the main argument against it. The speedups are only marginal and can also be achieved (to some extent) by tuning the existing implementation's parameters.

pitrou commented 16 years ago

I know it's not the place to discuss bpo-1629305, but the join() solution is not always faster. Why? Because 1) there's the list contruction and method call overhead 2) ceval.c has some bytecode hackery to try and make plain concatenations somewhat less slow.

As for cStringIO, I actually observed in some experiments that it was slower than the other alternatives, at least for short strings. All in all, choosing between the three idioms is far from obvious and needs case-by-case analysis.

pitrou commented 16 years ago

FWIW, I tried using the freelist scheme introduced in my patch without making PyUnicode a PyVarObject and, although it's better than the standard version, it's still not as good as the PyVarObject version. Would you be interested in that patch?

malemburg commented 16 years ago

Yes, definitely.

Some comments on style in your first patch:

please use unicode->length instead of the macro LENGTH you added
indents in unicodeobject.c are 4 spaces
line length should stay below 80

pitrou commented 16 years ago

After some more tests I must qualify what I said. The freelist patch is an improvement in some situations. In others it does not really have any impact. On the other hand, the PyVarObject version handles memory-bound cases dramatically better, see below.

With a small string: ./python -m timeit -s "s=open('INTBENCH', 'r').read()" "s.split()" -> Unpatched py3k: 26.4 usec per loop -> Freelist patch: 21.5 usec per loop -> PyVarObject patch: 20.2 usec per loop

With a medium-sized string: ./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()" -> Unpatched py3k: 458 usec per loop -> Freelist patch: 408 usec per loop -> PyVarObject patch: 316 usec per loop

With a long string: ./python -m timeit -s "s=open('Misc/HISTORY', 'r').read()" "s.split()" -> Unpatched py3k: 31.3 msec per loop -> Freelist patch: 32.7 msec per loop -> PyVarObject patch: 17.8 msec per loop

(the numbers are better than in my previous posts because the split-accelerating patch has been integrated)

Also, given those results, it is also clear that neither pybench nor stringbench really stress memory efficiency of strings, only processing algorithms.

That said, the freelist patch is attached.

pitrou commented 16 years ago

Here is an updated patch against the current py3k branch, and with spaces instead of tabs for indentation.

pitrou commented 16 years ago

Here is an updated patch, to comply with the introduction of the PyUnicode_ClearFreeList() function.

eacf80b2-9081-4882-a7f5-27e9644c3853 commented 16 years ago

Marc-Andre: Wit the udpated patches, is this a set of patches we can accept?

pitrou commented 16 years ago

Thanks for your interest Sean :) By the way, on python-3000 both GvR and Martin von Löwis were ok on the proposed design change, although they did not review the patch itself. http://mail.python.org/pipermail/python-3000/2008-February/012076.html

malemburg commented 16 years ago

Antoine, as I've already mentioned in my other comments, I'm -1 on changing the Unicode object to a variable size object.

I also don't think that the micro-benchmarks you are applying really do test the implementation in a real-life situations. The only case where your patch appears significantly faster is the "long string" case. If you look at the distribution of the Unicode strings generated by this case, you'll find that most strings have less than 10-20 characters. Optimizing pymalloc for these cases and tuning the parameters in the Unicode implementation will likely give you the same performance increase without having to sacrifice the advantages of using a pointer in the object rather than a inlining the data.

I'm +1 on the free list changes, though, in the long run, I think that putting more efforts into improving pymalloc and removing the free lists altogether would be better.

BTW: Unicode slices would be a possible and fairly attractive target for a C level subclass of Unicode objects. The pointer in the Unicode object implementation could then point to the original Unicode object's buffer and the subclass would add an extra pointer to the original object itself (in order to keep it alive). The Unicode type (written by Fredrik Lundh) I used as basis for the Unicode implementation worked with this idea.

amauryfa commented 16 years ago

Marc-Andre: don't all your objections also apply to the 8bit string type, which is already a variable-size structure? Is extending unicode more common than extending str?

With python 3.0, all strings are unicode. Shouldn't this type be optimized for the same use cases of the previous str type?

malemburg commented 16 years ago

Yes, all those objections apply to the string type as well. The fact that strings are variable length objects makes it impossible to do apply any of the possible optimizations I mentioned. If strings were a fixed length object, it would have been possible to write a true basestring object from which both 8-bit strings and Unicode subclass, making a lot of things easier.

BTW: Please also see ticket bpo-2321 to see how the change affects your benchmarks.

pitrou commented 16 years ago

Hi,

Marc-André, I'm all for "real-life" benchmarks if someone proposes some. Until that we have to live with micro-benchmarks, which seems to be the method used for other CPython optimizations by the way.

You are talking about slicing optimizations but you forget that the "lazy slices" idea has been shot down by Guido and others when proposed by Larry Hastings (with a patch) some time ago. Also the "lazy slices" patch was for 8-bit strings, which *are* variable-size objects, which seems to counter your argument that variable-size Unicode objects would prevent making such optimizations.

As I said the freelist changes actually have mixed consequences, and in general don't improve much (and that improvement is just backed by micro-benchmarks after all... why would it be more convincing than the far greater improvement brought by making Unicode objects variable-size objects?).

Why wouldn't you express your arguments in the python-3000 thread I launched one month ago, so that at least there is a clear picture of the different arguments for and against this approach?

malemburg commented 16 years ago

Regarding benchmarks: It's difficult to come up with decent benchmarks for things like this. A possible strategy is to use an instrumented interpreter that records which Unicode objects are created and when they are deleted. If you then run this instrumented interpreter against a few larger applications such as Zope for a while, you'll get a data set that can be used as input for a more real-life like benchmark. I've done this in the past for Python byte codes to see which are used how often - based on such data you can create optimizations that have a real-life effect.

Regarding the lazy slice patches: those were not using subclassing, they were patches to the existing type implementations. With subclassing you don't affect all uses of an object, but instead let the user control which uses should take advantage of the slice operations. Since the user also controls the objects which are kept alive this way, the main argument against Unicode slice objects goes away. This is different than patching the type itself and doesn't change the main object implementation in any way. Furthermore, it's possible to implement and provide such subclasses outside the Python core, which gives developers a lot more freedom.

Regarding discussions on the py3k list: I'm not on that list, since I already get more email than I can manage. I am on the python-dev list, if you want to take up the discussions there.

pitrou commented 16 years ago

Well I'm not subscribed to the python-3k list either - too much traffic indeed. You can read and post into it with gmane for example: http://thread.gmane.org/gmane.comp.python.python-3000.devel/11768 (there is probably an NNTP gateway too)

As for instrumenting the interpreter, this would tell us when and which strings are allocated, but not the precise effect it has on a modern CPU's memory subsystem (which is quite a complicated thing). Also, this is Py3k - we can't test any real-life applications until they are ported to it... (someone really motivated would have to learn Intel VTune or AMD CodeAnalyst and run such a "py3k real-life application" with it :-))

As for the explicit slicing approach, "explicit string views" have been discussed in 2006 on the python-3k list, including a proof-of-concept patch by Josiah Carlson - without any definitive pronouncement it seems. See the subthread here: http://mail.python.org/pipermail/python-3000/2006-August/003280.html

The reason I'm bringing in those previous discussions is that, in theory, I'm all for much faster concatenation and slicing thanks to buffer sharing etc., but realistically the previous attempts have failed for various reasons (either technical or political). And since those optimizations are far from simple to implement and maintain, chances are they will not be attempted again, let alone accepted. I'd be happy to be proven wrong :)

regards

Antoine.

malemburg commented 16 years ago

I've read the comments from Guido and Martin, but they don't convince me in changing my -1.

As you say: it's difficult to get support for optimizations such a slicing and concatenation into the core. And that's exactly why I want to keep the door open for developers to add such support via extension modules. With the 8-bit string implementation this was never possible.

With the Unicode implementation and the subclassing support for builtin types, it is possible to add support without having to go through all the hoops and discussions on python-dev or py3k lists.

I'm also for making Python faster, but not if it limits future optimizations and produces dead-ends. The fact that such optimizations haven't been implemented yet doesn't really mean anything either - people are not yet using Unicode intensively enough to let the need arise.

pitrou commented 16 years ago

Well, I'm not gonna try to defend my patch eternally :) I understand your opinion even if I find a bit disturbing that we refuse a concrete, actual optimization on the basis of future hypothetical ones.

Since all the arguments have been laid down, I'll let other developers decide whether further discussion of this proposal is useful or not.

malemburg commented 16 years ago

Regarding the benchmark: You can instrument a 2.x version of the interpreter to build the data set and then have the test load this data set in Py3k and have it replay the allocation/deallocation in the same way it was done on the 2.x system.

I also expect that patch bpo-2321 will have an effect on the performance since that patch changed the way memory allocation of the buffer was done (from using the system malloc() to using pymalloc's allocator for the buffer as well).

pitrou commented 16 years ago

You are right, bpo-2321 made the numbers a bit tighter:

With a small string: ./python -m timeit -s "s=open('INTBENCH', 'r').read()" "s.split()" -> Unpatched py3k: 23.1 usec per loop -> Freelist patch: 21.3 usec per loop -> PyVarObject patch: 20.5 usec per loop

With a medium-sized string: ./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()" -> Unpatched py3k: 406 usec per loop -> Freelist patch: 353 usec per loop -> PyVarObject patch: 314 usec per loop

With a long string: ./python -m timeit -s "s=open('Misc/HISTORY', 'r').read()" "s.split()" -> Unpatched py3k: 22.7 msec per loop -> Freelist patch: 24 msec per loop -> PyVarObject patch: 20.6 msec per loop

stringbench3k: -> Unpatched py3k: 266 seconds -> Freelist patch: 264 seconds -> PyVarObject patch: 249 seconds

Regarding your benchmarking suggestion, this would certainly be an interesting thing to do, but I fear it is also much more than I'm willing to do...

I'm going to post the updated patches.

malemburg commented 16 years ago

Thanks for running the tests again. The use of pymalloc for the buffer made a significant difference indeed. I expect that more can be had by additionally tweaking KEEPALIVE_SIZE_LIMIT.

It is interesting to see that the free list patch only appears to provide better timings for the "smaller" strings tests. I couldn't find the INTBENCH file anywhere in the Python source tree, but here's the distribution of the words of the latter two files:

Dev-Python/LICENSE: ------------------------------------------------------------------------ Length: [Count] 0----1----2----3----4----5----6----7----8----9----10 * 10% 1: [ 28] \=== 2: [ 350] \================================================= 3: [ 354] \================================================== 4: [ 190] \========================== 5: [ 191] \========================== 6: [ 158] \====================== 7: [ 164] \======================= 8: [ 132] \================== 9: [ 127] \================= 10: [ 102] \============== 11: [ 39] \===== 12: [ 34] \==== 13: [ 21] == 14: [ 10] = 15: [ 10] = 16: [ 0] 17: [ 0] 18: [ 1] 19: [ 0] 20: [ 0] 21: [ 1] 22: [ 0] 23: [ 0] 24: [ 0] 25: [ 1] 26: [ 1] 27: [ 1] 28: [ 0] 29: [ 1] 30: [ 0] 31: [ 0] 32: [ 0] 33: [ 0] 34: [ 0] 35: [ 0] 36: [ 2] 37: [ 0] 38: [ 0] 39: [ 1] 40: [ 0] 41: [ 0] 42: [ 0] 43: [ 1] 44: [ 1] 45: [ 0] 46: [ 0] 47: [ 0] 48: [ 0] 49: [ 0] 50: [ 1] 51: [ 0] 52: [ 0] 53: [ 0] 54: [ 0] 55: [ 0] 56: [ 0] 57: [ 0] 58: [ 0] 59: [ 0] 60: [ 0] 61: [ 0] 62: [ 0] 63: [ 1]

Dev-Python/Misc/HISTORY: ------------------------------------------------------------------------ Length: [Count] 0----1----2----3----4----5----6----7----8----9----10 * 10% 1: [ 6853] \================== 2: [13920] \===================================== 3: [18401] \================================================== 4: [12626] \================================== 5: [ 9545] \========================= 6: [ 9348] \========================= 7: [ 9625] \========================== 8: [ 7351] \=================== 9: [ 5353] \============== 10: [ 3266] \======== 11: [ 1947] \===== 12: [ 1336] \=== 13: [ 983] == 14: [ 638] = 15: [ 408] = 16: [ 288] 17: [ 286] 18: [ 216] 19: [ 176] 20: [ 147] 21: [ 120] 22: [ 116] 23: [ 85] 24: [ 89] 25: [ 70] 26: [ 44] 27: [ 59] 28: [ 39] 29: [ 32] 30: [ 65] 31: [ 23] 32: [ 26] 33: [ 28] 34: [ 19] 35: [ 18] 36: [ 9] 37: [ 18] 38: [ 5] 39: [ 10] 40: [ 9] 41: [ 9] 42: [ 1] 43: [ 1] 44: [ 8] 45: [ 4] 46: [ 5] 47: [ 5] 48: [ 3] 49: [ 3] 50: [ 4] 51: [ 0] 52: [ 0] 53: [ 2] 54: [ 0] 55: [ 0] 56: [ 4] 57: [ 2] 58: [ 4] 59: [ 1] 60: [ 0] 61: [ 1] 62: [ 0] 63: [ 1] 64: [ 1] 65: [ 9] 66: [ 1] 67: [ 1] 68: [ 0] 69: [ 0] 70: [ 16] 71: [ 1] 72: [ 0] 73: [ 1]

Compare that to a typical Python module source file...

Dev-Python/Lib/urllib.py: ------------------------------------------------------------------------ Length: [Count] 0----1----2----3----4----5----6----7----8----9----10 * 10% 1: [ 806] \================================================== 2: [ 672] \========================================= 3: [ 675] \========================================= 4: [ 570] \=================================== 5: [ 531] \================================ 6: [ 501] \=============================== 7: [ 296] \================== 8: [ 393] \======================== 9: [ 246] \=============== 10: [ 147] \========= 11: [ 150] \========= 12: [ 102] \====== 13: [ 90] \===== 14: [ 116] \======= 15: [ 61] \=== 16: [ 51] \=== 17: [ 45] == 18: [ 38] == 19: [ 31] = 20: [ 39] == 21: [ 24] = 22: [ 18] = 23: [ 18] = 24: [ 23] = 25: [ 17] = 26: [ 15] 27: [ 13] 28: [ 14] 29: [ 11] 30: [ 9] 31: [ 7] 32: [ 1] 33: [ 6] 34: [ 10] 35: [ 2] 36: [ 4] 37: [ 3] 38: [ 6] 39: [ 1] 40: [ 0] 41: [ 1] 42: [ 5] 43: [ 0] 44: [ 0] 45: [ 0] 46: [ 0] 47: [ 0] 48: [ 2] 49: [ 0] 50: [ 1] 51: [ 1] 52: [ 2]

The distributions differ a lot, but they both show that typical strings (both in English text and Python program code) have a length of \< 25 characters.

Setting KEEPALIVE_SIZE_LIMIT to 32 should cover most of those cases while still keeping the memory load within bounds. Raising the free list limit to e.g. 2048 would cause at most 64kB to be used by the free list

which is well within bounds of any modern CPU L2 cache.

pitrou commented 16 years ago

Well, of course most words in most languages are below 20 characters. Hence most strings containing words are also below 20 chars. But strings can also contain whole lines (e.g. decoding of various Internet protocols), which are statistically below 80 chars. I took that into account in the freelists patch (it keeps lots of \<15 char strings in memory, and a few \<90 char strings), which as you can see still yields rather little benefits :)

malemburg commented 16 years ago

I wasn't clear enough: my point was that your free list patch would probably benefit from some tuning of the cut-off parameters. 15 characters appears to be too small (see the HISTORY file histogram).

You'll likely get better results for 32 and 256 as cut-off values and by increasing the max list sizes to higher values, e.g. 1024 for the 32 character slots and 146 for the 256 character slots (see the counts in the histograms).

90baf024-6604-450d-8341-d796fe6858f3 commented 15 years ago

Collin, Can you test this patch with Unladen Swallow's benchmarks?

cdc164a4-3067-4410-9a2e-ca8363400c3e commented 15 years ago

Daniel, which patch? freelists2.patch or unialloc4.patch? If these are targeted py3k (judging by the "Versions" selector above), none of Unladen Swallow's benchmarks work under 3k (we're focusing on 2.x).

pitrou commented 15 years ago

Daniel, which patch? freelists2.patch or unialloc4.patch? If these are targeted py3k (judging by the "Versions" selector above), none of Unladen Swallow's benchmarks work under 3k (we're focusing on 2.x).

They target py3k indeed. Also, they need updating (at least the uniallocX patch which is the interesting part here).

pitrou commented 15 years ago

Updated patch against py3k. On a 64-bit system, each unicode object takes 14 bytes less than without the patch (using sys.getsizeof()). Two to four more bytes could be gained by folding the state member in the two lower bits of defenc, but I'm not sure it's worth the trouble.

malemburg commented 15 years ago

Antoine, I think we have to make a decision here: I'm still -1 on changing PyUnicodeObject to be a PyVarObject, but do like your experiments with the free lists.

I also still believe that tuning the existing parameters in the Unicode implementation and pymalloc would give a better performance gain than what your patch achieves.

Setting KEEPALIVE_SIZE_LIMIT to 32 would be a first start in that direction.

However, if you insist on changing the PyUnicodeObject structure, I'll have to reject the patch.

pitrou commented 15 years ago

As I already showed, the freelist experiments bring very little improvement.

malemburg commented 15 years ago

Ok, then closing the patch as rejected.

pitrou commented 15 years ago

Marc-André, please don't close the issue while you're the only one opposing it, thanks.

malemburg commented 15 years ago

Antoine, I have explained the reasons for rejecting the patch. In short, it violates a design principle behind the Unicode implementation.

If you want to change such a basic aspect of the Unicode implementation, then write a PEP which demonstrates the usefulness on a larger set of more general tests and comes up with significant results (10% speedup in some micro benchmarks is not significant; memory tests need to be run without pymalloc and require extra care to work around OS malloc optimization strategies).

Like I said: The current design of the Unicode object implementation would benefit more from advances in pymalloc tuning, not from making it next to impossible to extend the Unicode objects to e.g.

reuse existing memory blocks for allocation,
pointing straight into memory mapped files,
providing highly efficient ways to tokenize Unicode data,
sharing of data between Unicode objects, etc.

The reason I chose this design was to make the above easily implementable and it was a conscious decision to use a PyObject rather than a PyVarObject, like the string object, since I knew that the Unicode object was eventually going to replace the string object.

amauryfa commented 15 years ago

Looking at the comments, it seems that the performance gain comes from the removal of the double allocation which is needed by the current design.

Was the following implementation considered:

keep the current PyUnicodeObject structure
for small strings, allocate one chunk of memory: sizeof(PyUnicodeObject)+2*length. Then set self->str=(Py_UNICODE*)(self+1);
for large strings, self->str may be allocated separately.
unicode_dealloc() must be careful and not free self->str if it is contiguous to the object (it's probably a good idea to reuse the self->state field for this purpose).

malemburg commented 15 years ago

Amaury Forgeot d'Arc wrote:

Amaury Forgeot d'Arc \amauryfa@gmail.com\ added the comment:

Looking at the comments, it seems that the performance gain comes from the removal of the double allocation which is needed by the current design.

Was the following implementation considered:

keep the current PyUnicodeObject structure

for small strings, allocate one chunk of memory: sizeof(PyUnicodeObject)+2*length. Then set self->str=(Py_UNICODE*)(self+1);

for large strings, self->str may be allocated separately.

unicode_dealloc() must be careful and not free self->str if it is contiguous to the object (it's probably a good idea to reuse the self->state field for this purpose).

AFAIK, this was not yet been investigated.

Note that in real life applications, you hardly ever have to call malloc on small strings - these are managed by pymalloc as pieces of larger chunks and allocation/deallocation is generally fast. You have the same situation for PyUnicodeObject itself (which, as noted earlier, could be optimized in pymalloc even further, since the size of PyUnicodeObject is fixed).

The OS malloc() is only called for longer strings and then only for the string buffer itself - the PyUnicodeObject is again completly managed by pymalloc, even in this case.

pitrou commented 15 years ago

Marc-André, the problem is that all your arguments are fallacious at best. Let me see:

Like I said: The current design of the Unicode object implementation would benefit more from advances in pymalloc tuning, not from making it next to impossible to extend the Unicode objects to e.g. [...]

Saying that is like saying "we shouldn't try to improve ceval.c because it makes it harder to write a JIT". You are dismissing concrete actual improvements in favour of pie-in-the-sky improvements that nobody has seemed to try (you're welcome to prove me wrong) in 10 years of existence of the unicode type.

Besides, if someone wants to experiment with such improvements, it is not difficult to switch back to the old representation (my patch is very short if you discard the mechanic replacement of "self->length" with "PyUnicode_GET_SIZE(self)", which doesn't have to be undone to switch representations). So, I fail to see the relevance of that argument.

Antoine, I have explained the reasons for rejecting the patch. In short, it violates a design principle behind the Unicode implementation.

You seem to be the only one thinking this while, AFAIK, you haven't been the only one to work on that datatype.

(10% speedup in some micro benchmarks is not significant; memory tests need to be run without pymalloc and require extra care to work around OS malloc optimization strategies).

Actually, running performance or resource consumption tests without pymalloc is pointless since it makes the test completely artificial and unrelated to real-world conditions (who runs Python without pymalloc in real-world conditions?).

reuse existing memory blocks for allocation,

pointing straight into memory mapped files,

providing highly efficient ways to tokenize Unicode data,

sharing of data between Unicode objects, etc.

By the way, I haven't seen your patches or experiments for those. Giving guidance is nice, but proofs of concept, at the minimum, are more convincing. None of the suggestions above strike me as very /easy/ (actually, they are at least an order of magnitude harder than the present patch), or even guaranteed to give any tangible benefits.

To be clear, I don't think this proposal is more important than any other one giving similar results (provided these exist). But your arguments are never factual and, what's more, while I already did the same replies as I did here in other messages, you never bothered to be more factual. I would accept your refusal if your arguments had some semblance of concrete support for them.

amauryfa commented 15 years ago

The OS malloc() is only called... I know this. But pymalloc has its own overhead, and cache locality will certainly be better if string data is close to the string length.

The goal is to improve the current usage of strings, and not rely on hypothetical enhancements to the generic pymalloc.

74c4563b-ab1c-43d8-9219-30c4eca796bc commented 15 years ago

There were a number of patches to support sharing of data between unicode objects. (By Larry Hastings?) They were rejected because (a)
they were complicated, and (b) it was possible to provoke pathological memory retention.

pitrou commented 15 years ago

There were a number of patches to support sharing of data between unicode objects. (By Larry Hastings?) They were rejected because (a)
they were complicated, and (b) it was possible to provoke pathological memory retention.

Yes, it's the "lazy strings" patches by Larry Hastings (it was for str, not unicode, though). Issues are bpo-1590352 and bpo-1569040 (and perhaps others).

In any case, as I said, it is easy to switch back to the old representation, so I don't think it is an argument to block this patch.

malemburg commented 15 years ago

Jim Jewett wrote:

Jim Jewett \jimjjewett@users.sourceforge.net\ added the comment:

There were a number of patches to support sharing of data between unicode objects. (By Larry Hastings?) They were rejected because (a)
they were complicated, and (b) it was possible to provoke pathological memory retention.

Right, but the patches were targeting the main Unicode type implementation.

It would certainly be possible to implement these features on a Unicode sub-type.

Note that the Unicode type implementation on which the Python type is based did in fact use references to other objects in order to implement sharing.

This part was removed from the base type due to the issues with unwillingly keeping alive large reference objects. However, the implementation can be used as basis for writing a Unicode sub-type which does implement data sharing.

If you're looking for application space where such data sharing types are useful, have a look at parsing engines or routines that split larger chunks of data in multiple smaller pieces. Shared memory is another use case where such types would enable sharing of Unicode data between processes... but I'm repeating myself.

malemburg commented 15 years ago

Antoine Pitrou wrote:

Antoine Pitrou \pitrou@free.fr\ added the comment:

> There were a number of patches to support sharing of data between > unicode objects. (By Larry Hastings?) They were rejected because (a)
> they were complicated, and (b) it was possible to provoke pathological > memory retention.

Yes, it's the "lazy strings" patches by Larry Hastings (it was for str, not unicode, though). Issues are bpo-1590352 and bpo-1569040 (and perhaps others).

In any case, as I said, it is easy to switch back to the old representation, so I don't think it is an argument to block this patch.

That's not the case.

The patch breaks C API + binary compatibility for an essential Python type - that's not something you can easily undo.

pitrou commented 15 years ago

The patch breaks C API + binary compatibility for an essential Python type - that's not something you can easily undo.

I don't see how it breaks C API compatibility. No officially documented function has changed, and the accessor macros still work. Am I missing something?

As for binary compatibility, yes, it does break it, which isn't an exceptional situation in the development process. We have changed other "essential types" too -- for example, recently, the PyLong object got 30-bit digits on some systems. Why you think it is hard to undo, I don't understand.

As for the future ABI PEP, which has not yet been accepted, it does not mention PyUnicodeObject as part of the structures which are guaranteed to remain binary-compatible : http://www.python.org/dev/peps/pep-0384/#structures

malemburg commented 15 years ago

Antoine Pitrou wrote:

Antoine Pitrou \pitrou@free.fr\ added the comment:

> The patch breaks C API + binary compatibility for an essential Python > type - that's not something you can easily undo.

I don't see how it breaks C API compatibility. No officially documented function has changed, and the accessor macros still work. Am I missing something?

Yes: The layout and object type of the PyUnicodeObject object.

You cannot simply recompile your code and have it working. Instead, you have to provide different sub-typing implementations depending on whether PyUnicodeObject is a PyVarObject or PyObject, since these are inherently different in their structure.

Please note that all type objects documented in the header files not explicitly declared for private use only, are in fact public APIs. You need access to those type objects in order to be able to subclass them.

As for binary compatibility, yes, it does break it, which isn't an exceptional situation in the development process. We have changed other "essential types" too -- for example, recently, the PyLong object got 30-bit digits on some systems. Why you think it is hard to undo, I don't understand.

That's a different kind of change. Even though it's very hard to sub-type longs due to their PyVarObject nature and the fact that longs even dig into the PyObject_VAR_HEAD, you can still recompile your code and it will continue to work. The change was to a typedef - the name of the typedef itself has not changed.

This is similar to compiling Python as UCS2 or UCS4 version - Py_UNICODE will stay the same typedef, but on a UCS2 system it maps to 16 bits, whereas on a UCS4 system it is set to 32 bits.

Note that the Unicode implementation takes great care not to hide this binary incompatibility - by remapping all APIs to include the UCS2/UCS4 hint in the exported name. As an side: The long implementation does not.

As for the future ABI PEP, which has not yet been accepted, it does not mention PyUnicodeObject as part of the structures which are guaranteed to remain binary-compatible : http://www.python.org/dev/peps/pep-0384/#structures

That's fine, but please note that the ABI PEP only addresses applications that wish to benefit from the binary compatibility across Python versions.

It has no implications on applications that don't want to use the ABI or cannot, since they are too low-level, such as extensions wishing to sub-class built-in types.

pitrou commented 15 years ago

You cannot simply recompile your code and have it working.

Who is "you"? People doing mundane things with PyUnicodeObjects certainly can, assuming they use the macros for any member access.

Please note that all type objects documented in the header files not explicitly declared for private use only, are in fact public APIs.

If the datatype layout is not publicly documented in the API reference, it certainly seems to be a non-public part of the API. That's why we have macros for member access, instead of letting people access members directly.

The fact that my patch doesn't touch any part of the C sources except for the unicode implementation itself seems to support this view as well: people have been using the macros because they understand the actual layout shouldn't be relied upon.

You need access to those type objects in order to be able to subclass them.

As is needed for every other core object whose layout is nevertheless changed now and then... I think it should be expected that any code relying on low-level implementation specifics can break now and then. Changing low-level implementation specifics is often a prerequisite for improving things and it would be foolish to make a promise that we guarantee 100% compatibility at that level.

(we could of course strengthen the rules for unicode if it was demonstrated that there are several popular instances of subclassing unicode in a C extension. However, I haven't seen any such examples)

Note that the Unicode implementation takes great care not to hide this binary incompatibility - by remapping all APIs to include the UCS2/UCS4 hint in the exported name.

That's because there are UCS2 and UCS4 builds *of the same interpreter version*, and people are not necessarily aware of there being a difference. Such variability is not what we are talking about here.

malemburg commented 15 years ago

Here's an example implementation of a Unicode sub-type that allows referencing other Unicode objects:

http://downloads.egenix.com/python/unicoderef-0.0.1.tar.gz

As you can see, it's pretty straight-forward to write and I want to keep it that way.

python / cpython

improved allocation of PyUnicode objects #46235