how to smoothly get python and C code to interact

mspass-team / mspass

Massive Parallel Analysis System for Seismologists

https://mspass.org

BSD 3-Clause "New" or "Revised" License

30 stars 12 forks source link

how to smoothly get python and C code to interact #17

Open pavlis opened 5 years ago

pavlis commented 5 years ago

Have been mucking around with some things on a separate git branch. A few things emerged that I could use another head to help me sort out.

First a couple successes:

I have revised the seismic data objects. I have made the class previously called TimeSeries CoreTimeSeries and Seismogram has become CoreTimeSeries. That was just some editing
I have include files only that define an incomplete API for TimeSeries and Seismogram that will include hooks for anything related to mspass that we think are appropriate. At the moment it includes only two things: (1) An ErrorLogger object to centalize error handling, and (2) a stub for ObjectID which I think is the minimum required index to uniquely define a single document in Mongo.

Problems that surfaced:

I tried to add the boost::serialization stuff I had used in the antelope implementation. I discovered with some annoyance (in retrospect I shouldn't have bee surprised) that boost::any data cannot currently be serialized with the boost code. The reason is C is strongly typed and this apparently collides in some obscure way with the way boost's serialization code works. At least the chat I read on the web suggested there was little hope here. Not a big deal as I don't think we want or need to got there anyway for mspass. I only was going to try to do that to make it easier to build a test program to read some data and save it in an external format. Scratch that idea.
The main reason for creating this new issue is this showed the problem we are going to face about interchanging C and python code. It came up instantly as the new TimeSeries and Seismogram classes would use an ObjectID in the private area. Hence, the C code will need to be compiled with at least include links to MongoDB libraries. I am not sure how much problem this will create. ObjectID itself is a simple thing that could easily be sent around as it is represented by a hex string.

I'm kind of stuck on how to start. I think I need a playground where I can launch an instance of mongo and experiment. The UITS machines are problematic as a playground as they put things in weird places and I can't do hack fixes to avoid getting stuck. My mac is the other option, but macs debugging code on macs has been a long term problem I've had. I have an old machine I've been meaning to get running anyway that already has Ubuntu installed. I think I'll try to use it as my playground. Any suggests for an alternative?

Other points I think we should address in this thread are:

What are the things we want to pass back and forth between C code and python? As always we want numeric intensive algorithms to be in C/C++ to make them run faster.
What converters do we need in the short term to maximize speed of implementation without putting us down a dark alley? Seems we need a seamless way to convert obspy python objects into something my C code can digest and vice versa. I think I know ways to do that, but what should be done in python and what in C is bewildering.

There will be others, but I'm closing this as it is already too long.

wangyinz commented 5 years ago

Well, I actually don't know the answer to most of the points you listed here. Let me provide some thoughts.

The main reason for creating this new issue is this showed the problem we are going to face about interchanging C and python code. It came up instantly as the new TimeSeries and Seismogram classes would use an ObjectID in the private area. Hence, the C code will need to be compiled with at least include links to MongoDB libraries. I am not sure how much problem this will create. ObjectID itself is a simple thing that could easily be sent around as it is represented by a hex string.

I am concerned about linking mongodb libraries to the c++ code as that could potentially increase the complexity of the whole project. Currently I think we will only let the Python code interact with MongoDB. Our Python code might call the C++ code internally, but the ObjectID should only live within the scope of the Python code. I guess we need the ObjectID because of the ErrorLogger. To keep the design consistent, maybe it means that part of the functionality of ErrorLogger should be implemented in Python instead of C++. It is not clear to me how exactly we can do this, but I do think the general guideline is to limit the database related codes to Python only.

I'm kind of stuck on how to start. I think I need a playground where I can launch an instance of mongo and experiment. The UITS machines are problematic as a playground as they put things in weird places and I can't do hack fixes to avoid getting stuck. My mac is the other option, but macs debugging code on macs has been a long term problem I've had. I have an old machine I've been meaning to get running anyway that already has Ubuntu installed. I think I'll try to use it as my playground. Any suggests for an alternative?

I think you should be able to achieve all these in a Docker container. I got the container setup to run everything, and you should be able to debug the code there.

What are the things we want to pass back and forth between C code and python? As always we want numeric intensive algorithms to be in C/C++ to make them run faster.

This is really the core of the problem here. I think anything that relates to the database should be in python. I need to get some more done in Python before I can answer this question. I think the model will be similar to what we gonna do with obspy - extend from the existing objects.

What converters do we need in the short term to maximize speed of implementation without putting us down a dark alley? Seems we need a seamless way to convert obspy python objects into something my C code can digest and vice versa. I think I know ways to do that, but what should be done in python and what in C is bewildering.

Maybe that means we should design the python interface carefully so that it provides a uniform interface to the users with underlying obspy and our C++ library interchangeable. This shouldn't be difficult in Python once we have a clean C++ interface.

pavlis commented 5 years ago

I think we discussed some of last week too, and makes sense that we keep all db interactions in python. That is a helpful starting principle that helps reduce the confusion. A couple followups though:

I don't think ObjectID is an issue if we assume all db operations are in python. I think we can store the objectid in the C++ classes for now as a hex string that can be passed back and forth as easy as anything. If python writer is asked to do an update of data for a piece of data it can ask for the objectid, do a find, and then run the update methods. For writing I think the C++ api is well suited for such an update. Creation isn't an issue since a new objectid will be automatically generated. Reads are more the issue to create a C++ object. If the C code can't cleanly get a consistent handle to the one python owns I think some kind of internal transfer format will be needed, which will create a serious inefficiency. We face the same issue with obspy really as we'll need a simple way to read data already stored in the db - a core functionality. A challenge is to make the obspy reader and the C++ reader consistent.
I think building a consistent interface for obspy and my C++ code is pretty easy for TimeSeries data but problematic for 3C data. obspy trace objects basically use a dict to store header/Metadata attributes and an numpy array to hold the data. Both map pretty closely to the TimeSeries class (new CoreTimeSeries). I think we could, at least initially, use the obpsy model and just write a python translator to take a TimeSeries and make it an obspy Trace object. Then writers we make for obspy should work just as well, although with some overhead, for TimeSeries objects. Obspy constructors we should produce can be passed to a converter in either C or python to go the other way. 3C data are a much bigger problem as obspy's concept does not mesh with the concepts in my library. A definitive example I see that uses 3C data is their polarization module. It seems they just require 3 parallel numpy arrays the in my class is the transpose of the way I store things - a typical C - FORTRAN mismatch. I guess I would urge we give some thought to how we will store 3C data. I know I'm biased but I think the way I do 3C seismograms is completely superior to the way obspy does that. Antelope's programming library has the same issue because they focus on channel as a key concept. So much 3C processing gets ugly if you have to deal with channels and mapping. Best to do assemble the data into a 3C form up front and get rid of the baggage. I have a real example in dbxcor. It takes many seconds to read a 2 minute block of usarray data using the antelope db and all the baggage to assemble all the channels into a working gather. Prebuild the ensemble and read with something like serialization and the same data can be read in a fraction of the time. I am going to try to write a prototype for discussion.
Note ErrorLogger is not the issue. It should be easy once wrappers work to write a python function that will check the contents of the ErrorLogger and act in a consistent way.

So my next step to clean this is up, I think, is to be more concrete in how we build this api in C++. Give me a few days

wangyinz commented 5 years ago

I am not exactly sure why is there a internal transfer needed between Python and C++ code. I was thinking of making something that works like this:

It seems to me that we don't actually need to make the connection between C++ and ObsPy (marked by dash line) work. The Python component should be able to hide the complexity underneath it. Also, the Python component here should be more than just a bunch of database CRUD routines. It should have its own definition of TimeSeries and 3C Seismogram that extend from the ones in C++ Python bindings. It should also be able to utilize the routines available in ObsPy through wrappers. Then, the 3C data there should follow your design rather than ObsPy's. I do think your design is better in handling the data in a workflow. Also, once it is in Python, we can use pickle for serialization easily. In this design, the C++ component is more like a seispp library 2.0 - it could be a standalone library for C++ and the Python code extends its ability through added features like pickle and MongoDB. In the meantime, we can still implement serialization with boost.serialization, but that won't be used by Python.

pavlis commented 5 years ago

MsPASS Core I/O API

1. Design assumptions:

System will be using MongoDB
All db interactions will be written (at least initially) in python to avoid the suspected pitfalls of inconsistent handles for C++ and python.
Speed is important. Utilize C++ where possible for compute intensive tasks.
Each seismic object (3C or scalar seismograms – should be treated uniformly) should define a single document in a collection.
Ensembles should be defined by groups of documents – at least initially – defined by a group (and optional sort) clause. I think the method is db.who_is_range_aggregate, but that is an implementation detail we need to hide from the average user. i.e. the handle for an ensemble needs to provide a simple way to specify the keys that define the grouping. 2. Organization: This design document is organized in sections of the fundamental “CRUD” (Create Read Update Delete) database operations. A useful python reference randomly found on the internet to provide the foundations of this: https://medium.com/@MicroPyramid/mongodb-crud-operations-with-python-pymongo-a26883af4d09

2.1. C-reate The obvious core tool for this is collection col; col.insert_one( …) I think a comparable tool for ensemble is insert_many if we accept assumption 4, but suspect it will be easier to do by a loop over ensemble members saving each member with a variant of insert_one().

I’ll initially define a set of procedures that might form the API. These are cast here as methods that assume “this” has entities (e.g. a collection object) that define where the data would be written. If we make it purely procedural each function below would need an argument defining the db handle (probably a collection object as I read it).

I know this isn’t proper python syntax, but I hope it gets the idea across: // saves a obspy trace object. Some attributes are required in obspy, some are optional. // optional list should name things that are not core that should be save. It will need a default. def dbsave(Trace t, Stats s, optional list of keys of values to save) // save a TimeSeries object. The savelist thing contains a list of attributes to be written to the db. // This will require some similar features to the obspy version above, but with some variations def dbsave(TimeSeries d, MetadataList savelist) // similar for 3C data def dbsave(Seismogram d, MetadataList savelist) // Ensembles are just collections of one of the above. The simplest approach is to have the // the writer just be a loop with calls to dbsave for the member type. May not be the fastest // but remember the axiom: make it work before you make it fast.
// Ensemble metadata should always be saved and the algorithm should make sure ensemble // values override any values in the members if they are present in both places.
def dbsave(TimeSeriesEnsemble d, MetadataList savelist_ensemble, MetadataList savelist_members) def dbsave(ThreeComponentEnsemble d, MetadataList savelist_ensemble, MetadataList savelist_members)

Now I think the algorithm for all of these is roughly the same, but the implementation for obspy is rather different. Let me start with a generic algorithm for the C++ objects that use metadata.

1) Foreach s in MetadataList a. Switch on type b. Call proper get method for type (handle errors gracefully) c. Push result to a python dict 2) Call a generic function to write a dict to mongo 3) Call a function to write the data array of sample values with fwrite or a gridfs

For obspy:

1) Write the header dict data 2) Call the same function as above to write the data array of sample values

Obspy doesn’t have ensembles per se, but their Stream object is nearly identical in concept to a TimeSeriesEnsemble. It just doesn’t have a group header. A writer for one of these would be just a loop over the list of Trace objects calling the above function.

For the C++ code the ensemble metadata creates a complication that could be very useful, but which we will have to document carefully if we go this route. I would recommend the ensemble metadata be ALWAYS definitive over any member data AND be required to be stored in each member’s document. That will require writing a (quite trival) C++ function with this prototype:

template class void clone_ensemble_metadata(Ensemble& d) ;

with a python wrapper. I think it might be possible to implement with the += operator for Metadata which we can wrap. This thing should not need an exception handler as the algorithm is a loop over members calling put methods for each ensemble metadata attribute to each member. Once that function is called the writer for any ensemble is just a loop over the members just like the algorithm for an obspy Stream.

2.2. R-Read This is the hardest problem. I don’t see a way to do this consistenly in obspy and the C++ library. I’ll start with the C++ library since that is the one I’ve been thinking about most and know best.

I think the basic loader could have this signature: // return type T – I think in python requires coding each T supported.
def dbread_T(ObjectID oid, (optional)MetadataList mdl)

I think the python script would call find to locate the right document. Then it would need a loop to go through mdl fetching each key:value pair in the list and posting it to a dict. So, one base procedure/method would be where T was a python dict. Seismic data readers would then take a form like the pseudocode: 1) Header=dbread_Metadata(oid,mdl); 2) D=dbread_array(oid); // This would load a float or double array D with values 3) Build T from Header and D (different for different class types). 4) Return T

I think the only variation for obspy is that the mdl would have more importance and have some frozen keys. More on that below.

2.3. U-Update Updates are a variant of the create algorithm. The primary difference for C++ objects is the algorithm needs to fetch the ObjectID and call a find method before calling the update method. However, a feature I have in this version of Metadata makes the algorithm slightly more complicated but likely can make a significant difference in performance. That is, it has a private list with keys to the values that were changed since construction. (Note: I just noticed a hole in that api – there should be a method to retrieve that list of keys since it is not public.) The algorithm then should just take that list of keys and update only those.

For obspy I think the only thing we can do is call the update method instead of the one that creates a new document. We will need to put the ObjectID in the stats dict.

2.4. D-Delete This is simple provided we assure all objects, including the obspy Stats dict include ObjectID from which each data member was constructed.

3.0 Defining common attribute keys Both the python dict and Metadata are completely agnostic about what the key for any attribute should be – they use generic containers. In many ways the “schema” for MongoDB is little more than specification of the keys and value types for those keys. This gets to something that can get nauseating if handled by a committee that is commonly called an ontology. We only need part of that topic, which is the namespace, but noteworthy for speaking with some audiences.

Some points we need to consider in this design:

We probably need to implement a concept of required metadata versus optional metadata. All constructors should throw an exception if a required attribute is missing, but optional attributes should only create a log complaint for missing attributes.
We probably need to group attributes by concept following CSS3.0 relations. The obvious ones are: source attributes, receiver attributes, catalog (e.g. arrival times) attributes, waveform attributes, and processing attributes. The later might need to ultimately be forked into subsets.
We may want to define some core supported types that can be handled more efficiently than generic types. That is, all languages I know can handle the following faster than anything generic: real, integer, string, and boolean. (the only things the old Metadata implementation supported directly). We might want to handle more complicated things like an error log entry differently. Not sure about this though.
Related is the fact that type is a point were python and C++ collide most strongly. C++ is strongly typed, although not the most dogmatic language I’ve seen, while python is as loosey-goosey about type as any language I’ve seen. The consequence is any C implementation will need to be careful about handling type send to it through a python wrapper.
We can and should use the AttributeCrossReference object to handle name collisions. The most effective way to do that is to be determined, but the most likely use of this is some translation code that would translate one set of names to another. An example might be Seismic Unix names to our standard names. I have an existing C++ program to translate SU format data to the seispp framework that could easily be converted to mspass. Might be a useful early import module, but definitely not a starting point.

I think our first step, as stated in your original proposal, is to define the required attributes for waveform data. These should give us the experience we need to go further.

pavlis commented 5 years ago

Ian, I looked at your tests on the experimental branch for python. I have also spent quite a few hours trying to crack the boost documentation on installing and testing boost python. I am very frustrated by the documentation for boost python. I am pretty certain it has not been updated to reflect some major changes in the software. I'm hoping you can comment on the following points to confirm or deny I have this right and address questions:

It seems what all the web sources called "bjam" have become "b2". Did you find this to be true?
Building their test suite requires using bjam/b2. Did you ever get their test suite to build and run? I have struggled to fill in the obvious holes in the documentation like the user initialization file and how you hack their configure script. Next point describes why I quit trying.
When I looked back at your experimental branch that built wrappers with boost python for mspass I realized that I think you figure out how to use cmake to replace bjam/b2. Do I understand that right? If so, I know how to proceed. I'll just hack your cmake configuration.

No matter how we do the build I am definitely going to write the wrapper code manually for the C++ interface. We do not necessary want to expose all C++ class methods and functions to python. Some are better just used as C++ internals.

wangyinz commented 5 years ago

Actually, I didn't even start with the boost documentation, so I never tried the bjam related stuff. Instead, I think I found an example of using cmake with boost.python somewhere, and that's how I got the code compile. I guess you probably should just use my cmake setup as the template. It is pretty straightforward anyway.

pavlis commented 5 years ago

Ok, I'll use the cmake configuration you created. Note the boost version is anything but straightforward as bjam/b2 is yet another build system.

pavlis commented 5 years ago

Working through this. Realized I had to do some changes to the cmake configuration files. Learned some useful things there, but I'm stuck on a configuration question I have no easy way answer. I think you can do so quickly though.

The issue I'd like to know is if the version of boost you installed has a boost python library of some kind created by the install process for boost python. The issue is what is supposed to be done by this set of line in your cmake configuration file in the cxx directory for the python branch: FIND_PACKAGE(Boost COMPONENTS python) if (NOT Boost_PYTHON_FOUND) FIND_PACKAGE(Boost REQUIRED COMPONENTS python${PYTHON_VERSION_MAJOR}${PYTHON_VERSION_MINOR})

Is there a library this is supposed to set that defines this line message(STATUS "Boost_LIBRARIES = ${Boost_LIBRARIES}")

Or is that supposed to be set in the probe for boost? Anyway, if you can tell me what Boost_LIBRARIES resolves to when you get this to work it would be helpful.

wangyinz commented 5 years ago

So, there is a difference between the FIND_PACKAGE(Boost COMPONENTS python) and the FIND_PACKAGE(Boost) line earlier in the CMakeFileList.txt. The latter will only look for the standard boost headers (which only defines the Boost_INCLUDE_DIRS), and the former will find the boost.python library component in addition to the header, and it will define the Boost_LIBRARIES variable (in the docker image which has Ubuntu Bionic, it is defined as Boost_LIBRARIES = /usr/lib/x86_64-linux-gnu/libboost_python3.so). Note that depending on the version of Boost library and the version of Python, the library may have different names (e.g. libboost_python.so, libboost_python2.7.so, libboost_python3.6.so, etc). That's why I have the if branches, which is trying to capture all different combination.

The lines here makes the former FIND_PACKAGE(Boost) line redundant, and the setup of finding boost.python right now will fail instead of installing our own version when not found. I do think I should revise these behavior, but I plan to do it later when all python stuff is more solid.

Since you are working on an Ubuntu system, I think you should probably look at the Dockerfile, which has all the packages and setups based on Ubuntu. You should be able to get everything working following exactly the same recipe.

pavlis commented 5 years ago

Picking this thread up here after initial testing with boost python, pybind11, and cython. From that work (discussed in "boost python setup #19 ") I have these important conclusions for implementing MsPASS:

boost python is out.
pybind11 is a new improved version we should use as a replacement for boost python. It appears to be a more sustainable solution for wrapping C++ classes. It is very difficult to work with, however, and should be used only internally. i.e. we should NEVER expect any user to dive into that monstrosity.
cython seems to be a key tool for us. As noted in another post a key concept is that development cycle for an algorithm with cython in a resarch environment is rational: a. Write prototype in python for single threading b. If necessary test and adapt for Spark c. Convert python to cython to produce a compiled version for speed when necessary. d. As always use libraries for speed, but if necessary develop new library code in C/C++ to further improve performance.

Only in the last step would a C/C++ wrapper with pybind11 be needed.

A residual issue is that I am still struggling with trying to refine the line between C/C++ code and cython. A nasty issue with my seismic code is that the C++ code uses multiple inheritance. cython only recently seems to have added support for multiple inheritance. It is a hard problem because python doesn't support multiple inheritance. There doesn't seem to be a ton of information on the web on this topic and the standard documentation doesn't address it at all. I have been unable to make a wrapper for a simple test class even work, but I still may have something wrong in this implementation that is almost as complicated as pybind11. This morning I had a different idea I am going to pursue that might make this a lot simpler. That is, with cython I realized I didn't need to always actually wrap the C++ classes. A better model may be to write C++ procedures that call C++ library routines and use cython to write the wrappers for the function - much much easier than a complicated class like TimeSeries that uses a lot of advanced C++ features.
I'm going to start by trying to write an obspy Trace to TimeSeries object that can be compiled with cython. There will be some new challenges there, but would yield a useful core funtion anyway.

pavlis commented 5 years ago

Continued work with cython was not encouraging for using it as the vehicle for building our C++ to python api. I wrote a simple toy problem with multiple inheritance and ran into a long string of problems. Like pybind11 it was easy to build an interface for simple procedures and simple C++ classes. I did not crack the inheritance problem and kept hitting some odd issues that revealed some of the warts in cython.
In a fit of frustration I returned to pybind11 and tried the simple fix you suggested of changing m=mspasspy.Metadata to m=mspasspy.Metadata() Happily this worked perfectly. I could put and get double, int and string variables without any issues. I did, however, discover a mismatch between the way the two languages define a boolean variable that is going to take some interfacing to clean up. Fortunately, booleans are not something of top priority as not many data formats explicitly include them but most (e.g. SEGY) use integers 0 to define false and anything else true.

So that experience is changing my perspective. I suggest we may need to deal with both cython and pybind11. Here is what I would now advise:

pybind11 should be viewed as a C++ developer' tool only. i.e. unless I hit another huge gotcha my plan is too use pybind11 to implement the api between python and my C++ libraries.
I think the model suggested above for research code development that stresses cython should be the main externally advertised approach. For sure, we best have a working example for a tutorial before we do a release so that advice makes sense, but I think we can assume that would be the right development path.

So, I'm going to now move to flesh out the rest of the wrappers for TimeSeries and Seismogram objects using pybind11. Wish me luck

pavlis commented 4 years ago

A brief update on this issue. I've made a lot of progress on building wrappers for the C++ code to python. All the metadata related stuff has passed my current test program. I have CoreTimeSeries working as well. It took me a bit to understand an excessively terse description in the documentation on the proper way to map an std::vector container. Turned out you have to call a macro to bind a specific version like vector to a python name. Once I figured that out it the vector container looks just like a python list of all doubles.

I have only two residual hurdles to get over to finish the core code bindings:

I cannot make enum's work on the python side. It is confusing because I can ask for a value that is bound by the wrapper to a name that it will show but not work in an if contitional. This is definitely an easy fix once I figure out the right incantation.
The dmatrix usage in the Seisogram object is going to be a tougher nut to crack (I think). I think I found a solution in the documentation that provides a way to map the data in a dmatrix object to an numpy array object. They have a clean method for manipulating numpy arrays across the boundary. Can't tell yet how close I am to fixing that. I think their example provides a guide I can adapt.

Overall this has been a lesson in something that really require immersion into the documentation to be possible. The pybind11 documentation is terse and you have to know both python and C++ (especially) to comprehend what they are saying. They leave way way too many holes that can only be filled by looking at their examples. Problem with many of their examples is they are full of complicated template constructs that I find really hard to take apart and comprehend.

Anyway, I guess the key point for the record is I'm going to be the one to need to maintain these wrappers for as long as I'm around and capable. The wrapper code itself has tricks that make it a specialization that you don't want to learn other than adapting what we'll have here.

pavlis commented 4 years ago

As you noticed I checked in new branch with the pybind11 wrappers and it fails to compile on all the pieces travis runs. I think I know the reason, but don't know enough about cmake to fix it. The top level Cmake has this section to configure pybind11

if (PYTHONINTERP_FOUND) if (PYTHON_VERSION_MAJOR EQUAL 3) find_package(pybind11 REQUIRED) if(NOT pybind11_FOUND) message("pybind11 not found") endif() endif() else() message("Python not found") endif()

That is failing and the key line, I think, is this one: -- Found PythonInterp: /opt/pyenv/shims/python (found version "2.7.15")

The logic of the above must be wrong as was supposed to exit if not python3. pybind11 only works with python3. It works on my ubuntu system because it has 3 set as the default.

The other thing that is your call is what to do with the installation of pybind11. I don't know if the find_package line should download and install pybind11, but that is what is failing. On my system it works because I installed pybind11 in the default location of /usr/local

pavlis commented 4 years ago

I am hacking to get this to work on IU's carbonate system. For the record:

I had to use their "module" feature to enable python3, which pybind11 requires, and the configuration files I pulled from pybind11 properly force.
I then got this error from cmake which is fairly informative: CMake Error at CMakeLists.txt:66 (find_package): By not providing "Findpybind11.cmake" in CMAKE_MODULE_PATH this project has asked CMake to find a package configuration file provided by "pybind11", but CMake did not find one.

Could not find a package configuration file provided by "pybind11" with any of the following names:

pybind11Config.cmake pybind11-config.cmake

Add the installation prefix of "pybind11" to CMAKE_PREFIX_PATH or set "pybind11_DIR" to a directory containing one of the above files.
"pybind11" provides a separate development package or SDK, be sure it has been installed.

-- Configuring incomplete, errors occurred! See also "/N/u/pavlis/Carbonate/src/mspass/cxx/build/CMakeFiles/CMakeOutput.log". See also "/N/u/pavlis/Carbonate/src/mspass/cxx/build/CMakeFiles/CMakeError.log". [pavlis@i8 build]$

Not sure which of suggestions to follow. If pybind11 has a "development package or SDK" that would be a good solution. I am assuming this worked for me on my ubuntu machine because I was able to install pybind11 in the stock location of /usr/local

pavlis commented 4 years ago

This is your decision, Ian, but I think the experimental branch implementing pybind11 is ready to be integrated into the master branch for this repository. pybind11 makes boost python obsolete and I think we now know the role of cython is to allow users to speed up python scripts they develop by compiling them. Remains to see if pybind11 wrapped C++ code will play well with cython, but the chatter on the pybind11 github site and the existence of test code with cython suggests those developers and on top of that issue and we should let them worry about that.

Before you merge the experimental branch, however, I would suggest you test that you can at least build the experimental branch on your mac. Would suggest you also run the simplified test I created this morning. I have a larger one that tested the seismogram and timeseries wrappers on the inaccessible machine at home. Suggest we can use the simplified test below as the core of a formal test later or you can build on this one and create your own.

This script tests only the new MetadataDefinitions implementation: import mspasspy as msp mdf=msp.MDDefFormat.PF m=msp.MetadataDefinitions('test.pf',mdf) s=m.concept('dt') print(s) a=m.keys() print(a) b=m.aliases('t0') print(b)

You will need this file in the same directory you run the above script. Copy and paste the following to 'test.pf': dt &Arr{ type real concept &Tbl{ Sample interval in seconds. } } nsamp &Arr{ type int concept &Tbl{ Number of samples in a waveform segment. } } sta &Arr{ type string concept &Tbl{ Seismic station name code. } } t0 &Arr{ type double concept &Tbl{ Start time of a signal. } }

aliases &Tbl{ dt delta t0 starttime sta station }

Also note I have already done the clerical work of creating a pf file like this for the full obspy Trace object stats content, but again that is stranded on my machine at home that is currently off and behind a home router.

This example raises an implementation issue: it would be more rational to store the MetadataDefinitions data as json/bson. Probably should have a json based constructor in C++ to start and if we can figure out how to pass a MongoDB handle to C++ we could have a db constructor. Anyway THE KEY point is we definitely ought to support json. I think you have some standard libraries for this in C++ or know how to get them. I found some things in some old code of yours once, but don't think I have that on this laptop. What do you suggest for this?

wangyinz commented 4 years ago

I think you are right. Our proposed workflow for pybind11 and cython should be fine for now.

I agree that the branch is ready to be merged. You may not have realized that I already fixed the docker and cmake build files while you working on fixing other issues. You can see that the whole package builds fine in Travis (example). The build shows as failed soly because of the metadata tested got seg fault in it. I need to take a look at it to see what's going on, but I think it should be something minor.

I will test out the python script and figure out how to integrate that into the CTest we have right now.

I agree that we need to at least support json as the configure format. I think I once wrote a version of json style metadata in your seispp library. I will see if I can still find it. It should be easy to adopt that to current framework.

pavlis commented 4 years ago

Send me your code when you find it or, better yet, modify it to mesh with bson and MongoDB. In particular, I think bson may support some types not supported by stock json, but I am not sure of that.

I see you are actively messing with this. I may have broken Metadata as I tweeked a couple things in building the wrappers. I'll look at the history logs to see if I can get a hint why it is seg faulting. Could be an api change that somehow didn't get propagated through.

wangyinz commented 4 years ago

I already figured it out. It is not really a seg fault. It is just testing the error logs, which throws some errors deliberately. The real issue is that this test is not written to comply with CTest. CTest determines whether a test is successful or not based on the exit code of a command. Anyway, not something to really worry about for now.

wangyinz commented 4 years ago

Finally found the code here. I can't believe this was done 5 years ago. Let me see how to make it work with bson.

wangyinz commented 4 years ago

I was reading through the pybind11 code, and try to learn it so that I can at least understand how to change it if something comes up down the road. I realized that the get method of Metadata is not bound yet, and I wonder if that is due to the use of boost::any. I came across this piece of code that seems to serve as a good example on how to make pybind11 work with boost::any. What do you think?

pavlis commented 4 years ago

That code looks very interesting. I was planning a model where the MetadataDefinitions would be used to enforce type attached to any valid metadata. This could be more flexible, but I can't guess the relative cost in computational effort.

The reason I hadn't done the wrappers for the templated get is that I didn't see a way to do it. I think it would be possible to do something like define a python only method using the get template. For example, I think we could define get_float that was call call to get. Unless I missed it there is not automatic way to translate templates of anything to python except as specific instantiations.

wangyinz commented 4 years ago

Now I understand the issue much better. For documentation purpose: the get template cannot be overloaded through pybind11 because the return type check happens at runtime within the implementation of the method by ourselves, so there is no way to pybind11 to tell the correct type before the method is called.

I think we definitely don't want to impose that kind of limit of data types to python programmers as that will be very confusing for a duck typing language. Potential solutions are implementing either a wrapper of get method in python around all the get_* method or another instance of get method with the return type as boost::any and use pybind11's custom type caster to convert that type. The former will be easier to implement, and the latter will be cleaner to build and maintain. Either will need some efforts to learn how to make it work with what we have here.

pavlis commented 4 years ago

Not sure I agree with that, but my view is probably colored by a career working with languages where type was always a concern and hence just something to automatically consider. MongoDB will enforce type as all data are stored with an explicit type information on way or the other. I think we invite chaos if attributes can be stored without any way of enforcing the type associated with a specific key. That is at least true for any common attributes we want to define as form of a schema. So, I'm not sure we aren't inviting future problems by being too loose about type for attributes.

wangyinz commented 4 years ago

Well, probably I expressed this in a wrong way. Actually, I totally agree that we need to have a schema for the database. The point above is really only for the get method of Metadata, which is functional in C++ but not in Python. I was just exploring on resolving that issue - how to translate the returned template type from C++ to Python. This will not bring chaos as the type that a user can get is already restricted by the keyword being used. This template is really serving the purpose of simplifying the code.

pavlis commented 4 years ago

If we want to extend the capabilities for types beyond the current double, int, string, and bool pybind11 does provide a way to use templated functions with a unique name. e.g. if we want to provide support for something like get it would be easy (I think) to define a wrapper that would allow the python programmer to call get_float(key) and fetch a real32 value.

That example, brings up a complexity here. There are at least two different type issues we should think about. First, are simple size issues. e.g. in terms of concept float, double, double double, real32, real64, and real128 (using C definitions) are the same thing. Same is true for int, long int, long, short, short int, int16, int32, and int64. bool is bool in any language as a concept but how a bool is defined in a computer word structure is language dependent. Strings are similar. FORTRAN and C treat them differently and C++ defines a string object which is more or less but not exactly the same as a char *. The point is, get of these low level entities out to be seamless. I think we can take advantage of python there and to this at the db level in mongo. i.e. we might write a load_real, load_integer, load_boolean, and load_string in python that does all the checking against expected types, keys, and size specifications and hides that from the user. An approximate example prototype would have this signature: def load_real(dbhandle, key, mdef) where dbhandle is some object the interact with mongo, key is the string key for the fetch, and mdef is a MetadataDefinitions object. load_real would be loose about size and simply return a standard double.

For this kind of attribute I think we would need a similar save_real function that would similarly handle conversions. It would have to be a bit cautious about things like int trunction, but that is a detail.

The second kind of attribute we need to think about are we other "types" (aka objects) that MongoDB supports directly. For these I think we would want to b rigid about type in and out. It has been a couple months since I reviewed the list of supported types but it is fairly extensive. Anything not on that list would require serialization and a different kind of wrapper. These are likely best done as pure python entities anyway as these do not map directly to the C interface anyway.

That got MUCH longer than I had originally thought, but these ideas came to me as I was writing this response. In summary:

Stock low level types (integers, real numbers, boolean, and strings) can be handled at the C interface level by extending the wrappers a bit in a fairly simple way. To keep things sane I think we will want the main type enforcement to occur at the database load and save level.
objects should be treated differently. In all cases, an abstraction of the idea here is anything with a common concept should be saved seamlessly when possible but low level types have the secondary issue of size mismatch that may or may not always be important.

wangyinz commented 4 years ago

I thought pybind11's support for template is limited to functions of template type as argument instead of return type. That's why we currently don't have the T Metadata::get(string key) implemented in the Python interface. I think a wrapper in C++ should be able to resolve this issue, and I was thinking of using the custom type caster to implement that. Since you know pybind11 much better, I guess you probably knows some better way of doing that.

I agree the type support is gonna be something we should carefully design, since C++, Python and MongoDB all have different types supported. Since the core here is the interface with the database, I guess we should make our Python and C++ APIs adopt what MongoDB has. I think the low level types that MongoDB defines are 32-bit integer, 64-bit integer, 64-bit floating point, boolean, and UTF-8 string. The boolean and string have corresponding ones in both Python and C++. The numbers is where the complexity comes in. For the Python API, this is still straightforward. (Note that there is a subtle difference between Python2 and Python3, but we only need to support Python3 in this project.)

In Python, the int is unlimited in length, and the float is always 64-bit, so we can easily map them to the 64-bit integer and 64-bit floating point in MongoDB. We can optionally implement a 32-bit integer in Python, but I feel that is not really necessary.

In C++, we have some more types to deal with. For the floating point, it is still straightforward, since any 32-bit float can automatically get converted to 64-bit double. We only need to make sure anything that is loaded from the database will not get truncated accidentally. The integer types are messy since the length is always dependent on the platform and the C++ standard only defines the lower limit of a type. However, we can pretty much assume that our code will only be running on a platform that supports 64-bit. Therefore, to ensure that the integers read from the database won't get truncated, we should use long int for 32-bit integer and long long int for 64-bit integer. It is actually OK to use int and long int instead on a 64-bit Unix system according to here, but that could get us into trouble in some edge cases. It is also worth noting that there is no unsigned integer type in neither Python nor MongoDB, so any use of that in C++ need to be handled carefully.

pavlis commented 4 years ago

After reading what I wrote this morning and your response a few hours ago, I am wondering if we are worried about a problem that may not actually exist except on import and export. Generally size issus only arise with data read from a foreign source and some standardized format (e.g. SEGY is full of obnoxious, archaic 16 bit ints). I think your summary about internal usage is pretty solid and we should probably stick to it: (1) all ints are 64bit, (2) all reals are double (real64), and (3) strings are UTF-8. The last might be more limiting than necessary as there are lots of seamless wchar implementations these days. You would know better than me for sure, since I presume that is the standard for chinese characters.

wangyinz commented 4 years ago

I agree, let's just stick to 64-bit int and double. I can definitely look up the wchar stuff, but don't really have to consider that for now I guess.