Using preallocated buffers in python

Problem Hi! I was often running into the situation where I wanted to do sth like that:

a=np.zeros((2, 2, 1728 , 100), dtype=np.float32)
a = np.ascontiguousarray(a)
a[0, :, :, :] = some_mrc[240:242, :, 100:200]
a[1, :, :, :] = other_mrc[240:242, :, 100:200]
series.flush()

So, reading different chunks to different parts of a predefined numpy array. But this fails with: ValueError: vector::_M_default_append

There seams to exists an, as far as I know, undocumented workaround, that I just found:

a=np.zeros((2, 2, 1728 , 100), dtype=np.float32)
a = np.ascontiguousarray(a)
some_mrc.load_chunk(a[0, :, :, :], offset=[240,0,100], extent=[2,1728,100])
other_mrc.load_chunk(a[1, :, :, :], offset=[240,0,100], extent=[2,1728,100])
series.flush()

As the MeshRecordComponent.load_chunk method can take something providing the python buffer interface as an extra argument. But this is not that easy to use as with the slicing syntax and the [ ] operator.

Possible solution

It would be fantastic if sth like this a[0, :, :, :] = example_mrc[240:242, :, 100:200] could work, but I'm not sure how one would change the function execution based on the object on the left-hand side (though numpy broadcasting seams to do it somehow). Alternatively, one could have a new MeshRecordComponent method returning an object with a __get_item__ method, so that we can use syntax like this:

a=np.zeros((2, 2, 1728 , 100), dtype=np.float32)
a = np.ascontiguousarray(a)
some_mrc.get_buffer_loader(buffer=a[0, :, :, :])[240:242, :, 100:200]
other_mrc.get_buffer_loader(buffer=a[1, :, :, :])[240:242, :, 100:200]
series.flush()

Or maybe, there exists some class that is exposed in python that can be used to convert the slicing syntax to offset and extent? Sth like: some_mrc.load_chunk(a[0, :, :, :], *io.offset_and_extent_getter[240:242, :, :100:200])

I don't know much about how the python bindings work, otherwise I would implement it myself.

Btw, @ax3l do you have any idea how this works with dask? Is it even possible to read only part of the array with dask? to_dask_array seams to define the chunk over the whole volume.

Problem Hi! I was often running into the situation where I wanted to do sth like that:
a=np.zeros((2, 2, 1728 , 100), dtype=np.float32)
a = np.ascontiguousarray(a)
a[0, :, :, :] = some_mrc[240:242, :, 100:200]
a[1, :, :, :] = other_mrc[240:242, :, 100:200]
series.flush()
So, reading different chunks to different parts of a predefined numpy array. But this fails with: ValueError: vector::_M_default_append

This seems weird. The above code usually should run into an even more subtle error. The underlying problem is that a line such as a[0, :, :, :] = some_mrc[204:242, :, 100:200] is equivalent to a call of np.array.__setitem__(…), which will read from a buffer that has not yet been written to (series.flush() is called later only). Which is an extremely nasty trap to fall into.. I think that it is possible at least to give a warning/error in that situation (implemented by checking the buffer refcount during flush), but I don't know if it is possible to extend np.array.__setitem__(…) so that it works as expected. Basically, this invokes the assignment operations on two numpy arrays, which we have no control over.

I really don't know what causes the ValueError that you see, though.

There seams to exists an, as far as I know, undocumented workaround, that I just found:

The documentation status of our Python bindings is a bit unfortunate, yes #1328 But that is the intended API call for this purpose (so far). Since this entire situation is a nasty trap, there should generally be documentation on this.

a=np.zeros((2, 2, 1728 , 100), dtype=np.float32)
a = np.ascontiguousarray(a)
some_mrc.load_chunk(a[0, :, :, :], offset=[240,0,100], extent=[2,1728,100])
other_mrc.load_chunk(a[1, :, :, :], offset=[240,0,100], extent=[2,1728,100])
series.flush()
As the MeshRecordComponent.load_chunk method can take something providing the python buffer interface as an extra argument. But this is not that easy to use as with the slicing syntax and the [ ] operator.

Possible solution

It would be fantastic if sth like this a[0, :, :, :] = example_mrc[240:242, :, 100:200] could work, but I'm not sure how one would change the function execution based on the object on the left-hand side (though numpy broadcasting seams to do it somehow). Alternatively, one could have a new MeshRecordComponent method returning an object with a __get_item__ method, so that we can use syntax like this:
a=np.zeros((2, 2, 1728 , 100), dtype=np.float32)
a = np.ascontiguousarray(a)
some_mrc.get_buffer_loader(buffer=a[0, :, :, :])[240:242, :, 100:200]
other_mrc.get_buffer_loader(buffer=a[1, :, :, :])[240:242, :, 100:200]
series.flush()
Or maybe, there exists some class that is exposed in python that can be used to convert the slicing syntax to offset and extent? Sth like: some_mrc.load_chunk(a[0, :, :, :], *io.offset_and_extent_getter[240:242, :, :100:200])

I think that your last suggestion should be relatively straightforward to implement given the helper functions that we already have in the Python bindings. I would prefer a solution that would help us avoid the above trap though.

One third workaround suggestion: Add a helper class io.LoadToArray or something such that:

io.LoadToArray(array)[:, :, :] = mrc[:, :, :]
# ... would resolve to:
mrc.load_chunk(array, [0, 0, 0], […])

Thinking further about it, I think that it's a somewhat unfortunate API choice that something like mrc[:,:,:] would directly return a Numpy array at all. It would be much better if it returned a special type that makes it explicitly known that the buffer might not be ready yet. Returning it directly means that we yield control over something that we are not done with yet.

Compare our implementation of the Span API which has a similar intermediate stage to make it explicit that the underlying pointer can change: https://github.com/openPMD/openPMD-api/blob/dev/include/openPMD/Span.hpp

A while ago, @ax3l suggested that we add a synchronous (~safe) mode for the Python API. This mode would require no explicit flushing, but rather flush upon every operation. I suggest a solution to this issue built on such a mode:

The default mode will be the safe mode. Every load operation will implicitly trigger a flush. mrc[:,:,:] will return a numpy array filled with the expected data. Old code will automatically use this mode, retaining compatibility.
As opt-in, there will be an asynchronous mode. Operations must be flushed before accessing their results. mrc[:,:,:] will return an object of type io.PendingArray. Users must explicitly call something like mrc[:,:,:].get_pending(). Ideally, this would flush the load operation if still required. Implementing this kind of API in a new opt-in mode will allow implementing this breaking change without actually breaking existing code, existing code with similar bugs will implicitly be rid of this bug. I don't know if it might even be possible to support something like a[:,:,:] = mrc[:,:,:] then (essentially, we would need multiple dispatch to overload __setstate__ accordingly).
A load operation to preallocated array with slice syntax should somehow still be added.

I like the synchronous default in python. I think most of the time, that I was helping someone with the api, they were not flushing, or flushing at a wrong position. A missing series.flush() was also a reason for my first issue in this repository^^.

openPMD / openPMD-api

Using preallocated buffers in python #1466