Open gpshead opened 11 years ago
bytearray has many methods that return a *new* bytearray object rather than applying the transformation to modify the bytearray in place. Given that one use of bytearray's is to avoid making extra copies... There should be in-place variants of the following methods:
ljust lower lstrip rjust rstrip strip swapcase title translate upper
especially so for the methods that don't change the length or move data around such as translate, lower, upper, swapcase and title.
I think you need to make more of a case for 'should'. Bytearrays are, well, byte arrays, not text arrays. I could just as well claim that the ascii text operations should not even be there. They are certainly not needed for typical mixed binary-ascii protocol strings, which is what bytearrars were intended for. In any case, this would fatten the api considerably for not too much benefit. I think that this, like many or most enhancement proposals, should best be discussed on python-ideas list before any tracker action.
I consider translate an exception to the above comments. It is a byte operation, not a text operations. bytearry.translate could plausibly have been defined as 'in-place' when added. Most of the other operations are special cases of translate, and can therefore be done with translate, without the limitation to only ascii chars. If one wants to directly uppercase latin-1 encoded bytes without decoding to text, one would need .translate anyway.
Translate isn't a text operation. (That's the api I wanted). The others I only lists for completeness because someone else would complain if I hadn't. ;) On Mar 1, 2013 12:57 PM, "Terry J. Reedy" \report@bugs.python.org\ wrote:
Terry J. Reedy added the comment:
I think you need to make more of a case for 'should'. Bytearrays are, well, byte arrays, not text arrays. I could just as well claim that the ascii text operations should not even be there. They are certainly not needed for typical mixed binary-ascii protocol strings, which is what bytearrars were intended for. In any case, this would fatten the api considerably for not too much benefit. I think that this, like many or most enhancement proposals, should best be discussed on python-ideas list before any tracker action.
I consider translate an exception to the above comments. It is a byte operation, not a text operations. bytearry.translate could plausibly have been defined as 'in-place' when added. Most of the other operations are special cases of translate, and can therefore be done with translate, without the limitation to only ascii chars. If one wants to directly uppercase latin-1 encoded bytes without decoding to text, one would need .translate anyway.
---------- components: +Library (Lib) -Interpreter Core nosy: +terry.reedy
Python tracker \report@bugs.python.org\ \http://bugs.python.org/issue17301\
How *just and *strip can modify an argument in-place? All other methods except title (lower, swapcase, upper) are partial cases of translate. I.e. you need only in-place translate. However I doubt that difference will be more than several percents. Actually translate uses more tacts per byte than memcpy.
Gregory, if an in-place version of .translate is all you want, only ask for that. Change the title and let us focus on the best case. Forget what others *might* want.
Serhiy, what is 'tacts per byte'?
Sorry, Terry, I was meaning a CPU time.
+1
The important reasons for this are memory use and cache thrashing, not just CPU time.
An mtranslate, or "mutating translate" method for bytearrays. My first C code in a long time, so be gentle. The name is bad, but I don't see a better suggestion below.
I prefer ".inplace_translate" to ".mtranslate", but I'm not sure about it's worth enough to add it to bytearray.
In last year, I proposed adding bytes.frombuffer()
constructor to avoid
unnecessary memory copy.
https://mail.python.org/pipermail/python-dev/2016-October/146668.html
Nick Coghlan againsted for adding methods to builtin type only for low level programming, and suggested to implement it in 3rd party library first. https://mail.python.org/pipermail/python-dev/2016-October/146746.html https://mail.python.org/pipermail/python-dev/2016-October/146763.html
The important reasons for this are memory use and cache thrashing, not just CPU time.
Memory use is not an issue unless you translate hundreds of megabytes at a time. Cache trashing at the end is performance issue.
The original patch is no longer applied cleanly. Here is a patch synchronized with current sources and with fixed one error. I didn't look at it closely and don't know whether it has other bugs.
Here are results of microbenchmarking.
$ ./python -m perf timeit -s "table = bytes(range(256)).swapcase(); data = bytearray(range(256))*1000" -- "data = data.translate(table)"
Median +- std dev: 1.48 ms +- 0.02 ms
$ ./python -m perf timeit -s "table = bytes(range(256)).swapcase(); data = bytearray(range(256))*1000" -- "data[:] = data.translate(table)"
Median +- std dev: 1.60 ms +- 0.09 ms
$ ./python -m perf timeit -s "table = bytes(range(256)).swapcase(); data = bytearray(range(256))*1000" -- "data.mtranslate(table)"
Median +- std dev: 1.79 ms +- 0.07 ms
In-place translate don't have benefits. It is slower that translate with creating a new copy, and even is slower that translate with copying a new copy back.
Not every machine has hundreds of MB of memory to waste. Some devices (IoT, MicroPython, Raspberry Pi, OLPC) have limited resources.
Python interpreter itself takes 20-30 MB of memory. Using it on a machine that has no few tens of free memory doesn't make much sense. MicroPython can be an exception, but it has special modules for low-level programming.
If you need to proceed so large bytearray that creating yet one copy of it is not possible, it can be translated by separate elements or smaller chunks.
$ ./python -m perf timeit -s "table = bytes(range(256)).swapcase(); data = bytearray(range(256))*1000" -- "for i in range(len(data)): data[i] = table[data[i]]"
Median +- std dev: 205 ms +- 10 ms
$ ./python -m perf timeit -s "table = bytes(range(256)).swapcase(); data = bytearray(range(256))*1000" -- "for i in range(0, len(data), 100): data[i:i+100] = data[i:i+100].translate(table)"
Median +- std dev: 12.9 ms +- 0.6 ms
$ ./python -m perf timeit -s "table = bytes(range(256)).swapcase(); data = bytearray(range(256))*1000" -- "for i in range(0, len(data), 1000): data[i:i+1000] = data[i:i+1000].translate(table)"
Median +- std dev: 3.12 ms +- 0.22 ms
$ ./python -m perf timeit -s "table = bytes(range(256)).swapcase(); data = bytearray(range(256))*1000" -- "for i in range(0, len(data), 10000): data[i:i+10000] = data[i:i+10000].translate(table)"
Median +- std dev: 1.79 ms +- 0.14 ms
Translating by chunks also works with data that can't be fit in memory at all. You can mmap large file or read and write it by blocks. In-place bytearray method wouldn't help in this case.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['easy', 'type-feature', 'library']
title = 'An in-place version of many bytearray methods is needed'
updated_at =
user = 'https://github.com/gpshead'
```
bugs.python.org fields:
```python
activity =
actor = 'Philip Dye'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation =
creator = 'gregory.p.smith'
dependencies = []
files = ['29825', '46137']
hgrepos = []
issue_num = 17301
keywords = ['patch', 'easy']
message_count = 13.0
messages = ['183082', '183292', '183302', '183478', '183502', '183530', '183531', '183534', '186833', '284618', '284625', '284626', '284634']
nosy_count = 11.0
nosy_names = ['terry.reedy', 'gregory.p.smith', 'vstinner', 'christian.heimes', 'ezio.melotti', 'methane', 'THRlWiTi', 'martin.panter', 'serhiy.storchaka', 'n', 'Philip Dye']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'needs patch'
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue17301'
versions = ['Python 3.4']
```