python / cpython

The Python programming language
https://www.python.org
Other
63.4k stars 30.36k forks source link

C API for appending to arrays #49391

Open 61a746e0-56f6-4f74-bbde-a98c5612db23 opened 15 years ago

61a746e0-56f6-4f74-bbde-a98c5612db23 commented 15 years ago
BPO 5141
Nosy @pitrou, @hniksic, @websurfer5

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['expert-C-API', 'type-feature', 'library'] title = 'C API for appending to arrays' updated_at = user = 'https://github.com/hniksic' ``` bugs.python.org fields: ```python activity = actor = 'Jeffrey.Kintscher' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)', 'C API'] creation = creator = 'hniksic' dependencies = [] files = [] hgrepos = [] issue_num = 5141 keywords = [] message_count = 6.0 messages = ['81039', '81168', '81189', '87828', '87838', '87883'] nosy_count = 6.0 nosy_names = ['ggenellina', 'kxroberto', 'pitrou', 'hniksic', 'bfroehle', 'Jeffrey.Kintscher'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue5141' versions = [] ```

61a746e0-56f6-4f74-bbde-a98c5612db23 commented 15 years ago

The array.array type is an excellent type for storing a large amount of "native" elements, such as integers, chars, doubles, etc., without involving the heavy machinery of numpy. It's both blazingly fast and reasonably efficient with memory. The one thing missing from the array module is the ability to directly access array values from C.

This might seem superfluous, as it's perfectly possible to manipulate array contents from Python/C using PyObject_CallMethod and friends. The problem is that it requires the native values to be marshalled to Python objects, only to be immediately converted back to native values by the array code. This can be a problem when, for example, a numeric array needs to be filled with contents, such as in this hypothetical example:

/* error checking and refcounting subtleties omitted for brevity */
PyObject *load_data(Source *src)
{
   PyObject *array_type = get_array_type();
   PyObject *array = PyObject_CallFunction(array_type, "c", 'd');
   PyObject *append = PyObect_GetAttrString(array, "append");
   while (!source_done(src)) {
     double num = source_next(src);
     PyObject *f = PyFloat_FromDouble(num);
     PyObject *ret = PyObject_CallFunctionObjArgs(append, f, NULL);
     if (!ret)
       return NULL;
     Py_DECREF(ret);
     Py_DECREF(f);
   }
   Py_DECREF(array_type);
   return array;
}

The inner loop must convert each C double to a Python Float, only for the array to immediately extract the double back from the Float and store it into the underlying array of C doubles. This may seem like a nitpick, but it turns out that more than half of the time of this function is spent creating and deleting those short-lived floating-point objects.

Float creation is already well-optimized, so opportunities for speedup lie elsewhere. The array object exposes a writable buffer, which can be used to store values directly. For test purposes I created a faster "append" specialized for doubles, defined like this:

int array_append(PyObject *array, PyObject *appendfun, double val)
{
   PyObject *ret;
   double *buf;
   Py_ssize_t bufsize;
   static PyObject *zero;
   if (!zero)
     zero = PyFloat_FromDouble(0);

   // append dummy zero value, created only once
   ret = PyObject_CallFunctionObjArgs(appendfun, zero, NULL);
   if (!ret)
     return -1;
   Py_DECREF(ret);

   // append the element directly at the end of the C buffer
   PyObject_AsWriteBuffer(array, (void **) &buf, &bufsize));
   buf[bufsize / sizeof(double) - 1] = val;
   return 0;
}

This hack actually speeds up array creation by a significant percentage (30-40% in my case, and that's for code that was producing the values by parsing a large text file).

It turns out that an even faster method of creating an array is by using the fromstring() method. fromstring() requires an actual string, not a buffer, so in C++ I created an std::vector\<double> with a contiguous array of doubles, passed that array to PyString_FromStringAndSize, and called array.fromstring with the resulting string. Despite all the unnecessary copying, the result was much faster than either of the previous versions.

Would it be possible for the array module to define a C interface for the most frequent operations on array objects, such as appending an item, and getting/setting an item? Failing that, could we at least make fromstring() accept an arbitrary read buffer, not just an actual string?

1fd7a44c-f7f2-43ed-9c9f-bafa512b8598 commented 15 years ago

Arrays already support the buffer interface

61a746e0-56f6-4f74-bbde-a98c5612db23 commented 15 years ago

Yes, and I use it in the second example, but the buffer interface doesn't really help with adding new elements into the array.

1409fbaa-f956-4d99-a567-589cf071a381 commented 15 years ago

I had a similar problem creating a C-fast array.array interface for Cython. The array.pxd package here (latest zip file) http://trac.cython.org/cython_trac/ticket/314 includes a arrayarray.h file, which provides ways for efficient creation and growth from C (extend, extend_buffer, resize, resize_smart ). Its probably in one of the next Cython distributions anyway, and will be maintained. And perhaps array2 and arrayM extension subclasses (very light-weight numpy) with public API coming soon too. It respects the different Python versions, so its a lite "quasi API". And in case there will be a (unlikely) change in future Pythons, the Cython people will take care as far as there is no official API coming up. Or perhaps most people with such interest use Cython anyway.

pitrou commented 15 years ago

This has more chances of seeing some progress if you propose a patch.

1409fbaa-f956-4d99-a567-589cf071a381 commented 15 years ago

A first thing would be to select a suitable prefix name for the Array API. Because the Numpy people have 'stolen' PyArray instead of staying home with PyNDArray or so ;-)

In case sb goes into this: Other than PyList_ like stuff and existing members, think for speedy access (like in Cython array.pxd) a direct resizing, the buffer pointer, and something handy like this should be directly exposed:

int 
PyArr_ExtendFromBuffer(PyObject *arr, void* stuff, Py_ssize_t items)