openPMD / openPMD-api

:floppy_disk: C++ & Python API for Scientific I/O
https://openpmd-api.readthedocs.io
GNU Lesser General Public License v3.0
134 stars 51 forks source link

Using preallocated buffers in python #1466

Open pordyna opened 1 year ago

pordyna commented 1 year ago

Problem Hi! I was often running into the situation where I wanted to do sth like that:

a=np.zeros((2, 2, 1728 , 100), dtype=np.float32)
a = np.ascontiguousarray(a)
a[0, :, :, :] = some_mrc[240:242, :, 100:200]
a[1, :, :, :] = other_mrc[240:242, :, 100:200]
series.flush()

So, reading different chunks to different parts of a predefined numpy array. But this fails with: ValueError: vector::_M_default_append

There seams to exists an, as far as I know, undocumented workaround, that I just found:

a=np.zeros((2, 2, 1728 , 100), dtype=np.float32)
a = np.ascontiguousarray(a)
some_mrc.load_chunk(a[0, :, :, :], offset=[240,0,100], extent=[2,1728,100])
other_mrc.load_chunk(a[1, :, :, :], offset=[240,0,100], extent=[2,1728,100])
series.flush()

As the MeshRecordComponent.load_chunk method can take something providing the python buffer interface as an extra argument. But this is not that easy to use as with the slicing syntax and the [ ] operator.

Possible solution

It would be fantastic if sth like this a[0, :, :, :] = example_mrc[240:242, :, 100:200] could work, but I'm not sure how one would change the function execution based on the object on the left-hand side (though numpy broadcasting seams to do it somehow). Alternatively, one could have a new MeshRecordComponent method returning an object with a __get_item__ method, so that we can use syntax like this:

a=np.zeros((2, 2, 1728 , 100), dtype=np.float32)
a = np.ascontiguousarray(a)
some_mrc.get_buffer_loader(buffer=a[0, :, :, :])[240:242, :, 100:200]
other_mrc.get_buffer_loader(buffer=a[1, :, :, :])[240:242, :, 100:200]
series.flush()

Or maybe, there exists some class that is exposed in python that can be used to convert the slicing syntax to offset and extent? Sth like: some_mrc.load_chunk(a[0, :, :, :], *io.offset_and_extent_getter[240:242, :, :100:200])

I don't know much about how the python bindings work, otherwise I would implement it myself.

pordyna commented 1 year ago

Btw, @ax3l do you have any idea how this works with dask? Is it even possible to read only part of the array with dask? to_dask_array seams to define the chunk over the whole volume.

franzpoeschel commented 1 year ago

Problem Hi! I was often running into the situation where I wanted to do sth like that:

a=np.zeros((2, 2, 1728 , 100), dtype=np.float32)
a = np.ascontiguousarray(a)
a[0, :, :, :] = some_mrc[240:242, :, 100:200]
a[1, :, :, :] = other_mrc[240:242, :, 100:200]
series.flush()

So, reading different chunks to different parts of a predefined numpy array. But this fails with: ValueError: vector::_M_default_append

This seems weird. The above code usually should run into an even more subtle error. The underlying problem is that a line such as a[0, :, :, :] = some_mrc[204:242, :, 100:200] is equivalent to a call of np.array.__setitem__(…), which will read from a buffer that has not yet been written to (series.flush() is called later only). Which is an extremely nasty trap to fall into.. I think that it is possible at least to give a warning/error in that situation (implemented by checking the buffer refcount during flush), but I don't know if it is possible to extend np.array.__setitem__(…) so that it works as expected. Basically, this invokes the assignment operations on two numpy arrays, which we have no control over.

I really don't know what causes the ValueError that you see, though.

There seams to exists an, as far as I know, undocumented workaround, that I just found:

The documentation status of our Python bindings is a bit unfortunate, yes #1328 But that is the intended API call for this purpose (so far). Since this entire situation is a nasty trap, there should generally be documentation on this.

a=np.zeros((2, 2, 1728 , 100), dtype=np.float32)
a = np.ascontiguousarray(a)
some_mrc.load_chunk(a[0, :, :, :], offset=[240,0,100], extent=[2,1728,100])
other_mrc.load_chunk(a[1, :, :, :], offset=[240,0,100], extent=[2,1728,100])
series.flush()

As the MeshRecordComponent.load_chunk method can take something providing the python buffer interface as an extra argument. But this is not that easy to use as with the slicing syntax and the [ ] operator.

Possible solution

It would be fantastic if sth like this a[0, :, :, :] = example_mrc[240:242, :, 100:200] could work, but I'm not sure how one would change the function execution based on the object on the left-hand side (though numpy broadcasting seams to do it somehow). Alternatively, one could have a new MeshRecordComponent method returning an object with a __get_item__ method, so that we can use syntax like this:

a=np.zeros((2, 2, 1728 , 100), dtype=np.float32)
a = np.ascontiguousarray(a)
some_mrc.get_buffer_loader(buffer=a[0, :, :, :])[240:242, :, 100:200]
other_mrc.get_buffer_loader(buffer=a[1, :, :, :])[240:242, :, 100:200]
series.flush()

Or maybe, there exists some class that is exposed in python that can be used to convert the slicing syntax to offset and extent? Sth like: some_mrc.load_chunk(a[0, :, :, :], *io.offset_and_extent_getter[240:242, :, :100:200])

I think that your last suggestion should be relatively straightforward to implement given the helper functions that we already have in the Python bindings. I would prefer a solution that would help us avoid the above trap though.

One third workaround suggestion: Add a helper class io.LoadToArray or something such that:

io.LoadToArray(array)[:, :, :] = mrc[:, :, :]
# ... would resolve to:
mrc.load_chunk(array, [0, 0, 0], […])
franzpoeschel commented 1 year ago

Thinking further about it, I think that it's a somewhat unfortunate API choice that something like mrc[:,:,:] would directly return a Numpy array at all. It would be much better if it returned a special type that makes it explicitly known that the buffer might not be ready yet. Returning it directly means that we yield control over something that we are not done with yet.

Compare our implementation of the Span API which has a similar intermediate stage to make it explicit that the underlying pointer can change: https://github.com/openPMD/openPMD-api/blob/dev/include/openPMD/Span.hpp

A while ago, @ax3l suggested that we add a synchronous (~safe) mode for the Python API. This mode would require no explicit flushing, but rather flush upon every operation. I suggest a solution to this issue built on such a mode:

pordyna commented 1 year ago

I like the synchronous default in python. I think most of the time, that I was helping someone with the api, they were not flushing, or flushing at a wrong position. A missing series.flush() was also a reason for my first issue in this repository^^.