naturalsize np.int32 multiplication overflow

Toprak2 commented 1 day ago

What did you do?

I was using the torchio library, which relies on humanize to return the memory size of image arrays. The images I processed had dimensions 512x512x166, with each pixel being a 32-bit (4-byte) integer.

What did you expect to happen?

torchio calls the naturalsize function with the occupied memory size in bytes and sets binary=True. Manually calculating the expected value: 512×512×166×4÷(1024×1024)=166 MiB

So, I expected the function to return approximately 166 MiB.

What actually happened?

Instead, the returned value was -2 MiB, accompanied by this warning:

RuntimeWarning: overflow encountered in scalar multiply
ret: str = format % ((base * bytes_ / unit)) + s

This overflow occurs because the input to the function was of type np.int32 instead of Python's native int. Since np.int32 has a maximum value of 2^31−1, the multiplication of base and bytes_ results in an overflow.

Steps to Reproduce


import humanize
import numpy as np

print(humanize.naturalsize(512*512*166*4, binary=True))  
# Expected: 166.0 MiB
# Works as expected with Python’s built-in int type

print(humanize.naturalsize(np.int32(512*512*166*4), binary=True))
# Returns: -2.0 MiB
# RuntimeWarning: overflow encountered in scalar multiply ret: str = format % ((base * bytes_ / unit)) + s

Proposed Solutions

1) Change the Order of Operations Adjusting the order of operations can avoid overflow. In the current line:


ret: str = format % ((base * bytes_ / unit)) + s

when bytes is np.int32, multiplying base and bytes produces an np.int32 result, which overflows before it’s divided by unit. By dividing either base or bytes_ by unit before the multiplication, each sub-operation remains a float:


ret: str = format % ((base * (bytes_ / unit))) + s
# or
ret: str = format % ((base / unit * bytes_)) + s

2) Convert Input to Float Alternatively, cast value to float without checking its type. Currently, the casting applies only if value is a string:


# Current approach
if isinstance(value, str):
    bytes_ = float(value)
else:
    bytes_ = value

Updating it to cast all inputs to float could resolve the issue:


bytes_ = float(value)

I haven’t created a pull request since I’m unsure if you’d prefer developers to ensure input compatibility or handle this within the function. Environment

OS: Windows 11
Python: 3.12.1
Humanize: 4.11.0
Numpy: 1.26.3

hugovk commented 9 hours ago

Thank you for the well explained report!

As you suggest, we could say that np.int32 inputs are not supported and the caller should make sure they're using an int, float or str, as documented, but the suggested fixes appear quite straightforward. I'll reserve the right to revert if it causes problems in the future :)

Let's go for option 1, because option 2 fails this test:

    def test_naturalsize(test_args: list[int] | list[int | bool], expected: str) -> None:
>       assert humanize.naturalsize(*test_args) == expected
E       AssertionError: assert '1000.0 ZB' == '1.0 YB'
E
E         - 1.0 YB
E         + 1000.0 ZB

Would you like to open a PR?

Toprak2 commented 8 hours ago

Thank you for reviewing the issue quickly.

I’ll proceed with implementing option 1 and will create a PR soon. If there are any additional tests you’d like me to add, please let me know.

hugovk commented 8 hours ago

Yeah, I was considering if we should add a test using NumPy, but np.int32 isn't really a supported input, and installing NumPy can be tricky during the Python pre-release phase when there's no wheel, so maybe not.

python-humanize / humanize