python / cpython

The Python programming language
https://www.python.org
Other
63.46k stars 30.39k forks source link

The C-API for Python to C integer conversion is, to be frank, a mess. #102471

Open markshannon opened 1 year ago

markshannon commented 1 year ago

The C-API has built up over 30 years, in a haphazard way. So, it is no surprise that it is a bit of a mess. What makes it worse is that it is based around the C long type, which is varies in size between architectures and operating systems in odd ways. C longs are 32 bit on (almost?) all 32 bit machines, 64 bit on most 64 bit machines, except Windows when C longs are 32 bits on 64 bit machines. In other words, it is not a useful fixed size, like int32_t, nor does match the machine word size, like intptr_t.

We need a more consistent API for converting from Python integers to C integers and back again. We should support both 32 bit and word size C integers. 32 bit, because we often want to store 32 bit values to save space on 64 bit machines, or for portability. We also want to support word size integers for performance and ease of coding.

This means we want 4 functions (2 sizes, 2 directions) to convert between C and Python integers.

Currently we have:

Width Py -> C C -> Py
32 bit Missing* Missing
Machine word Missing* PyLong_FromSsize_t

The C API has a function to convert Python ints to intptr_t, but it is missing efficient overflow handling. It also has a function with efficient overflow handling, PyLong_AsLongAndOverflow, but that returns a long.

Here's what we want:

Width Py -> C C -> Py
32 bit PyInt_AsInt32 PyInt_FromInt32
Machine word PyInt_AsSsize_t PyInt_FromSsize_t

I'm using PyInt prefix, now that Python 2 is history. It makes it clearer what is the new API.

Note that I'm not handling unsigned values. I think the extra bit of precision is not worth the complexity of a larger API. And if we decide that they are, we can always add them later.

Linked PRs

markshannon commented 1 year ago

We also need a few functions for querying and extracting the value of a Python int.

We want to query its sign:

int PyInt_IsNegative();
int PyInt_IsPositive();
int PyInt_IsZero();
int PyInt_Sign();

We want to import and export the digits of an integer, and to know how many digits there are. GNU's MP library has mpz_import and mpz_export, which have quite a complex API, but might be a good model to use. In addition we should provide a constant describing the "native" number of bits per digit, so that C extensions can extract the data efficiently.

markshannon commented 1 year ago

mpz_import and mpz_export take 6 parameters each, and four of those are small numbers describing the layout. Having many int parameters is hard to read and error-prone. We should combine the layout parameters into a single struct (of 32 bits or less).

E.g.

typedef struct _PyIntExportLayout {
     uint8_t bits_per_digit,
     int8_t word_endian,
     int8_t array_endian,
     uint8_t digit_size,
} PyIntExportLayout;

PyLongObject *PyInt_Import(PyIntExportLayout layout, size_t count, const void *data);
int PyInt_Import(PyLongObject *op, PyIntExportLayout layout, size_t count, void *data);
size_t PyInt_DigitCount(PyLongObject *op, uint8_t bits_per_digit);
const PyIntExportLayout PY_INT_NATIVE_LAYOUT; /* Use this when possible, for speed */
casevh commented 1 year ago

Hi. I'm the primary maintainer of gmpy2. I'd like to provide some comments with my experiences using the C-API.

I use PyLong_AsLongAndOverflow when I want a long value or immediately proceed with the full conversion of PyLong to mpz as quickly as possible. Avoiding the exception is a significant performance improvement. PyLong_AsUnsignedLongAndOverflow is used occasionally when GMP is expects an unsigned long.

PyLong_As[Unsigned]LongLongAndOverflow were used with MPIR to get 64-bit values on Windows. (MPIR extended GMP to support 64-bit native integer sizes.) gmpy2 doesn't currently use them but it would be nice if they could be kept.

I like your PyIntExportLayout idea for specifying the . I have a question about the usage of PyInt_Import - which side owns the conversion?

Is PyInt_Import intended to access external data (i.e. the mpz data) and create a PyLong? Does PyIntExportLayout then specify the format of the mpz data?

Would there be a corresponding PyInt_Export that exports the value of a PyLong into an external buffer with the format of the external buffer controlled by PyIntExportLayout? If so, who owns (CPython versus gmpy2) the memory allocated to the external buffer? (Note: GMP, MPFR, and MPC can use a different memory manager than CPython....)

This is reversed from the current conversion direction. For mpz to PyLong, gmpy2 asks CPython to create a new PyLong with sufficient space to store the output of mpz_export. And for PyLong to mpz, gmpy2 creates a new mpz with sufficient space to store the output of mpz_import.

I'll add another comment to the thread about the compact format.

Thanks for all the effort in improving CPython.

casevh

vstinner commented 4 months ago

32 bit PyInt_AsInt32 PyInt_FromInt32

I created https://github.com/python/cpython/pull/120390 for that.

serhiy-storchaka commented 4 months ago

I had plans to add PyLong_Import() and PyLong_Export() with GMP/libtommath inspired signatures. This is too general interface which allows to support many different representations.

skirpichev commented 4 months ago

This is too general interface which allows to support many different representations.

This is relatively complex task, which is better suited to dedicated libraries. I would be rather surprised if some arbitrary precision math library lacks mpz_import/export-like functions.

If on CPython side we will have a "view" of integers as an array of digits - the rest of work could do any math library.

serhiy-storchaka commented 4 months ago

Then please used different names than PyLong_Import()/PyLong_Export().

vstinner commented 2 months ago

We need a more consistent API for converting from Python integers to C integers and back again. We should support both 32 bit and word size C integers. 32 bit, because we often want to store 32 bit values to save space on 64 bit machines, or for portability. We also want to support word size integers for performance and ease of coding.

I added APIs for that with https://github.com/python/cpython/commit/4c6dca82925bd4be376a3e4a53c8104ad0b0cb5f:

vstinner commented 2 months ago

We want to query its sign: int PyInt_Sign();

PyLong_GetSign() was added to Python 3.14: https://docs.python.org/dev/c-api/long.html#c.PyLong_GetSign

int PyInt_IsNegative(); int PyInt_IsPositive(); int PyInt_IsZero();

There is an open discussion for these functions: https://github.com/capi-workgroup/decisions/issues/29