python / cpython

The Python programming language
https://www.python.org
Other
63.55k stars 30.45k forks source link

C API: Consider adding public PyLong_AsByteArray() and PyLong_FromByteArray() functions #111140

Closed vstinner closed 6 months ago

vstinner commented 1 year ago

Feature or enhancement

The private _PyLong_AsByteArray() and _PyLong_FromByteArray() functions were removed in Python 3.13: see PR #108429.

@scoder asked what is the intended replacement for _PyLong_FromByteArray().

The replacement for _PyLong_FromByteArray() is PyObject_CallMethod((PyObject*)&PyList_Type, "from_bytes", "s#s", str, len, "big") but I'm not sure what is the easy way to set the signed parameter to True (default: signed=False).

The replacement for _PyLong_AsByteArray() is PyObject_CallMethod(my_int, "to_bytes", "ns", length, "big"). Same, I'm not sure how to easy set the signed parameter to True (default: signed=False).

I propose to add public PyLong_AsByteArray() and PyLong_FromByteArray() functions to the C API.

Python 3.12 modified PyLongObject: it's no longer a simple array of digits, but it's now a more less straightforward _PyLongValue structure which requires using unstable functions to access small "compact" values:

So having a reliable and simple way to import/export a Python int object as bytes became even more important.


A code search for _PyLong_AsByteArray in PyPI top 5,000 projects found 12 projects using it:

Linked PRs

scoder commented 8 months ago

There are good use cases for both – reading the signed (actual) value of a PyLong and reading its abs() value. Given that the internal representation makes the latter very efficient, it seems tempting to pass that efficiency on to the users.

Maybe the unsigned function could have a second return argument pointer that gives the sign as 1 and -1? Users can pass NULL if they know the sign, and look at the returned sign if they don't. Raising an exception is easy enough to leave it to users if sign and expectation diverge.

zooba commented 8 months ago

reading its abs() value

This is yet another case that nobody has mentioned previously, though I agree it's temptingly cheap to expose (at least for large values - anything less than Py_ssize_t is going to involve a comparison and negation - and of course performance characteristics can change over time, which is why we're exposing an API that isn't based on the internal representation in the first place).

But (unsigned)-2147467259 is not the same as abs(-2147467259), which is why having the C-style conversion is useful.

What's really the problem here is extracting values that require all the available bits to provide full fidelity. So one solution could be an optional out parameter that returns true iff it a zero sign bit was the only bit that couldn't be copied. So then the checks become:

[^1]: Note that negative values cannot overflow by just the sign bit. At least one leading 1 bit has to remain, and so (signed)0x...FFFF_7FFF cannot fit into 16 bits with a sign bit overflow. (signed)0x...FFFF_FFFF can always fit into 16 bits - the problem is that (unsigned)0xFFFF can also fit into 16 bits but we need to know that the 17th bit would've been zero.

[^2]: You wouldn't really need to check res > sizeof(target) here, but we should specify that the flag is only set when a signed overflow occurred

If you're going to treat the value as signed, you only need to check res <= sizeof(target), as today. (If you want the sign, look at the MSB of the result.)

If you're going to treat the result as unsigned, you also allow sign_bit_overflow_only == 1 (noting that negative input values will never need to set this new flag).

encukou commented 8 months ago

reading its abs() value

Without knowing the use cases my first reaction is that we need to draw the line somewhere, and I'd be OK with this joining exporting non-byte digits on the other side of the line.

What's really the problem here is extracting values that require all the available bits to provide full fidelity.

Which is fairly important since C-ish APIs tend to smuggle unrelated information in high bits. If everyone used ints for actual counting, we wouldn't be here :)

[scoder] Maybe the unsigned function could have a second return argument pointer that gives the sign as 1 and -1? [zooba] one solution could be an optional out parameter that returns true iff it a zero sign bit was the only bit that couldn't be copied

Or a nullable char *sign_out argument:

zooba commented 8 months ago

Again, it's not negative values that are the tricky problem - it's large positive values. By excluding negatives, all you're doing is annoying the caller.

The extra information that has to be returned is not "was the input positive or negative", it's "was the only information that was lost the sign bit". And this is only relevant for positive values because if you omit the sign bit from a negative value in two's complement, you change the value and so have always lost more than just the sign bit.

I updated #116053 last night to do the extra checks we need, and there are comments where that extra information needs to be returned when we figure out how best to do it (either by returning it, or by taking a flag that says to assume it).

encukou commented 8 months ago

Ah, I think I finally get it. There are 3 cases:

Is that right?

In my suggestion, I was thinking about a new function -- PyLong_AsUnsignedNativeBytes(..., char *sign_out) -- to cover the last two.

zooba commented 8 months ago

If we have a new function, that's signal enough that we don't need to give it an extra argument. When you're calling that function, if the input was positive but the resulting MSB is set, we don't care (provided nothing higher than the MSB needed to be set).

I don't honestly see the benefit in rejecting negatives. The same rule applies - if the input was negative, provided the MSB is set (no information loss) and everything higher than the MSB would be set (sign extension), we can return success (which is the same as for the signed case). I'd rather just add a function for getting the sign from the PyLongObject so that people who want to reject negatives can do it, but I wouldn't want to conflate it with choosing between AsNativeBytes and AsUnsignedNativeBytes.

The range check (sign check) is to do with business logic, not with the binary representation.

encukou commented 8 months ago

So, writing 255 and -1 into 1-byte buffer would have the same result -- all bits set, which PyLong_FromUnsignedNativeBytes would turn into 255. Right?

IMO, accepting negatives in AsUnsignedNativeBytes is a footgun that at least needs a prominent note in the docs. I see it as perpetuating C's mistakes. But, I can see where you're coming from, and I can live with our difference in opinions.

zooba commented 8 months ago

all bits set, which PyLong_FromUnsignedNativeBytes would turn into 255. Right?

and PyLong_FromNativeBytes would turn into -1. Right.

IMO, accepting negatives in AsUnsignedNativeBytes is a footgun that at least needs a prominent note in the docs.

This is fine, but my counterpoint is that there's no other way to do it in our C API (and the way to do it in Python is to & 0xFFFFF...., which is a pain to do dynamically). So if we cut it off, we force users into complex workarounds, whereas if we allow it then it becomes possible.

And I think the documentation for this makes the most sense framed as "behaves like AsNativeBytes but assumes the result will be used as unsigned, and so does not require positive input values to leave the most significant bit clear. This may result in large positive inputs being indistinguishable from some negative inputs. To exclude negative inputs, first test the sign with \<new API>"

gpshead commented 8 months ago

I like the direction this is going, yes, that is the way I was hoping an Unsigned API variant would behave. I do think it is useful to have a way to return that the value was negative. Petr's char *sign_out idea makes sense to me there, always fill that in with 0 or -1 if it is non-NULL.

scoder commented 6 months ago

The interface seems complete and usable now. Is this done now or is there anything left for this ticket to stay open?

gpshead commented 6 months ago

Looking things over I like the C API that what was settled upon. It seems to address all of the needs from our earlier discussions.