python / cpython

The Python programming language
https://www.python.org
Other
63.08k stars 30.21k forks source link

Copying bytes object to shared memory list truncates trailing zeros #106939

Open pinkhamr-fb opened 1 year ago

pinkhamr-fb commented 1 year ago

Bug report

tl;dr; See stack overflow post

When copying a bytes object to a shareable list, the trailing zeros are stripped causing data loss. This doesn't appear in the documentation as far as I can tell, and seems to be unexpected behavior related to the implementation.

Example code:

from multiprocessing import shared_memory as shm

shmList = shm.ShareableList([bytes(50)])
testBytes = bytes.fromhex("00112233445566778899aabbccddeeff0000")

shmList[0] = testBytes
print(testBytes)
print(shmList[0])

shmList.shm.close()
shmList.shm.unlink()

Output:

b'\x00\x11"3DUfw\x88\x99\xaa\xbb\xcc\xdd\xee\xff\x00\x00'
b'\x00\x11"3DUfw\x88\x99\xaa\xbb\xcc\xdd\xee\xff'

Offending portion of CPython code:

_back_transforms_mapping = {
        0: lambda value: value,                   # int, float, bool
        1: lambda value: value.rstrip(b'\x00').decode(_encoding),  # str
        2: lambda value: value.rstrip(b'\x00'),   # bytes
        3: lambda _value: None,                   # None
    }

Linked PRs

corona10 commented 1 year ago

@pitrou @gpshead Would you like to take a look at this issue?

gpshead commented 1 year ago

While that is "surprising" behavior, the implementation of that ShareableList does not appear to make good guarantees.

  1. [x] We should document this behavior / bug today for existing <=3.12 releases. Regardless of if we backport a bugfix, people need to know that they may be writing code running on impacted versions.
  2. [ ] We should fix this for any impacted types for 3.13+. Both str and bytes are impacted. Trailing \x00 characters are valid in both.

Workaround: unconditionally append a single non-0 character or byte to any shared data when putting items in and unconditionally ignore the final character (truncation or memoryview) on the consuming side.

There are other constraints worth documenting as well. those "int"s are a maximum of 8 bytes struct packed without specifying if they are signed or not. https://docs.python.org/3/library/multiprocessing.shared_memory.html#multiprocessing.shared_memory.ShareableList needs improvement.

pinkhamr-fb commented 1 year ago

FWIW, the workaround you proposed is what I ended up doing in my code to get around this.

zdelv commented 10 months ago

I'm willing to work on a fix for this. Is implementing the workaround mentioned into ShareableList considered an acceptable solution, or are we looking for something more involved?

To me, it seems like the issue is that we're padding all str and bytes to an 8 byte alignment, but we're forgetting to save the actual data length. Adding a sentinel value to the end of the str or bytes (like the workaround does) seems like the most reasonable method to fixing it without changing the underlying encoding to add the actual data length.