numpy / numpy

The fundamental package for scientific computing with Python.
https://numpy.org
Other
28.21k stars 10.17k forks source link

BUG: bytes(arr) != arr.tobytes() for 0-d arrays #22782

Open alanhdu opened 1 year ago

alanhdu commented 1 year ago

Describe the issue:

Perhaps this is a false assumption on my part, but I assumed that given a NumPy array that arr.tobytes() and bytes(arr) would alwyas return the same thing, but this does not seem to be true for 0-dimensional arrays. In this case, bytes(arr) returns an empty bytestring, while arr.tobytes() returns the element casted as a byte. I personally find the latter behavior more intuitive, but I think they should probably be consistent.

Reproduce the code example:

arr = np.array(0)
assert arr.tobytes() == bytes(arr)

Error message:

AssertionError

NumPy/Python version information:

NumPy: 1.23.5 
Python: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:26:04) [GCC 10.4.0]

Context for the issue:

No response

seberg commented 1 year ago

Seems Python's bytes() checks for the object being an integer (obj.__index__()) before trying to convert it as a bytes-like object in the bytes() constructor. So, there is a bit of Python involvement there due to bytes() being heavily overloaded.

However, bytes() first checks for the __bytes__() dunder-method. So, NumPy could implement __bytes__() to ensure the .tobytes() meaning here.

The main weird thing may be what to do about np.int64(0) (the scalar) if we do that.

ColinPeppler commented 1 year ago

Just pointing out that bytearray(arr) follows the same output as bytes(arr). But, I don't believe bytesarray relies on __bytes__ like bytes does (see). If that's right, changing __bytes__ would create some inconsistency between the two.

miccoli commented 1 year ago

I would say that this is expected behaviour, not a bug.

Since both numpy scalars and numpy 0d array have a memoryview, the correct invariants are written as

>>> arr = np.array(0)
>>> scal = np.int64(1)
>>> assert bytes(memoryview(arr)) == arr.tobytes()
>>> assert bytes(memoryview(scal)) == scal.tobytes()

On the contrary, python integers do not have a memoryview:

>>> memoryview(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: memoryview: a bytes-like object is required, not 'int'
>>> memoryview(arr.item())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: memoryview: a bytes-like object is required, not 'int'
>>> memoryview(scal.item())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: memoryview: a bytes-like object is required, not 'int'

Given the current semantics of bytes and bytearray, it would be terribly wrong to try to fix this from the numpy side.

Maybe one could argue with the Python devs that if an object exposes the buffer protocol, this should take precedence over the “number” meaning... but currently it works the opposite way.