mspass-team / mspass

Massive Parallel Analysis System for Seismologists
https://mspass.org
BSD 3-Clause "New" or "Revised" License
28 stars 11 forks source link

deficiency in pybind11 bindings with numpy #474

Closed pavlis closed 7 months ago

pavlis commented 7 months ago

I just discovered pybind11's stock bindings we use to map a C++ std::vector into an array like object have a mismatch with numpy. Recall:

  1. We use the buffer protocol to make a TimeSeries and Seismogram data arrays more numpy like. They aren't numpy arrays but they can work as efficiently as numpy arrays and can substitute for numpy arrays in most contexts.
  2. The actual type pybind11's bindings define for our data arrays is a DoubleVector for TimeSeries and a dmatrix (a custom type) for Seismogram.

The issue is that we've used many examples before of passing a TimeSeries data vector to a numpy function. That works fine. Most (maybe all) numpy processing functions return an altered numpy vector with the result. The problem is the inverse mapping does not work. This tiny example illustrates the point:

import numpy as np
from mspasspy.ccore.seismic import TimeSeries
d=TimeSeries(100)
dnp = np.ones(100)
d.data=dnp
print(d.data)

This will throw an exception that generates this message:

TypeError: (): incompatible function arguments. The following argument types are supported:
    1. (self: mspasspy.ccore.seismic._CoreTimeSeries, arg0: mspasspy.ccore.seismic.DoubleVector) -> None

Noting the opposite case shown below will work correctly:

# load d with 2.0 to make sure this isn't an initialization issue
for i in range(d.npts):
    d.data[i]=2.0
y=np.array(d.data)
print(y)

which for me shows an array with all 2.0 values. i.e. that direction works.

I do not know for sure what happens with Seismogram's data array. Suspect that is worse as the numpy matrices use the "shape" concept to define generic multidimensional arrays.

This requires a fix in the pybind11 code I'm not quite sure how to do. Why this is important is that discussions we are having about adapting pytorch and/or tensorflow will hit this problem quickly as my understanding is both want input as numpy arrays.

wangyinz commented 7 months ago

hmmm…. I don’t think there is a problem with the implementation. Note that in the first example, it should work with d.data=DoubleVector(dnp). I think we use a lot of constructs like that in the code.

wangyinz commented 7 months ago

Just searched for examples and I am now pretty sure that is the case: https://github.com/mspass-team/mspass/blob/e6eeb2511c52f0ef6b5bcefa5a36682a9a951426/python/mspasspy/io/distributed.py#L671

I guess I should close this issue now.