mspass-team / mspass

Massive Parallel Analysis System for Seismologists
https://mspass.org
BSD 3-Clause "New" or "Revised" License
30 stars 12 forks source link

Type definition interface #20

Open pavlis opened 4 years ago

pavlis commented 4 years ago

In working on the C++ wrapper problem I have learned that pybind11 creates a seamless binding between the C++ map container and the python dictionary. This is very good news as it makes the interface between the C++ code, python, and mongodb very clear. python can do the database transactions and easily get and put attributes to the C++ data objects.

The wrinkle in all that is that python is agnostic about type while C++ is "strongly typed". To make the interface work we need a way to quickly and efficiently sort that out. I propose the C++ class here be used for that purpose. Writing the wrappers for this to python will be easy ow that I have this mastered:

///////////////////////////////////////////////////////
class MetadataDefinitions
{
public:
  /*! Default constructor - defaults to mspass namespace. */
  MetadataDefinitions()[];
  /*! \brief Construct using a specified attribute namespace.

  This constructor uses an alternative definition to the default namespace.
  It is an implementation detail (to be worked out) on where the data for
  any attribute namespace lives.

  \param mdname is a name tag used to define the attribute namespace.
  */
  MetadataDefinitions(const std::string mdname);
  /*! Standard copy constructor. */
  MetadataDefinitions(const MetadataDefinitions& parent);
  /*! Return a description of the concept this attribute defines.

  \param key is the name that defines the attribute of interest

  \return a string with a terse description of the concept this attribute defines.
  */
  std::string concept(const std::string key) const;
  /*! Get the type of an attribute.

  \param key is the name that defines the attribute of interest

  \return MDtype enum that can be used to establish the proper type.
  */
  mspass::MDtype type(const std::string key) const;
  /*! Basic putter.

  Use to add a new entry to the definitions.   Note that because this is
  implemented with C++ map containers if data for the defined key is
  already present it will be silently replaced.

  \param key is the key for indexing this attribute
  \param concept_ is brief description saved as the concept for this key
  \param type defines the type to be defined for this key.
  */
  void put(const std::string key, const std::string concept_, const MDtype mdt);
  /*! \brief Methods to handle aliases.

  Sometimes it is helpful to have alias keys to define a common concept.
  For instance, if an attribute is loaded from a ralational db one might
  want to use alias names of the form table.attribute as an alias to attribute.
  has_alias should be called first to establish if a name has an alias.
  To get a list of aliases call the aliases method.
  */
  bool has_alias(const std::string key) const;
  list<std::string> aliases(const std::string key) const;
  /*! Add an alias for key.

  \param key is the main key for which an alias is to be defined
  \param aliasname is the the alternative name to define.
  */
  int add_alias(const std::string key, const std::string aliasname);
  /*! Standard assignment operator. */
  MetadataDefinitions& operator=(const MetadataDefinitions& parent);
  /*! Accumulate additional definitions.   Appends other to current.
  Note that because we use the map container any duplicate keys in other
  will replace those in this.
  */
  MetadataDefinitions& operator+=(const MetadataDefinitions& other);
private:
  map<std::string,MDtype> tmap;
  map<std::string,string> cmap;
  multimap<std::string,std::string> aliasmap;
};

/////////////

Once wrapped a typical usage would be something like this:

import mspasspy as msp
md=msp.MetadataDefinitions()   # defaults to the mspass namespace
keys=['sta','chan','source_lat','source_lon']
d=msp.Seismogram()
load_metadata(d,keys,dbhandle,md)  # python procedure to load list through dbhandle

The load_metadata script would have lines like this (not necessarily valid python - I'm still a bit rusty) also omiting any error handlers

t=md.type(key)
if(t==MDreal)  # not the right syntax for a bound enum, but shows idea
   call db getter for a real = rval
   d.put(key,rval)
else if(t==MDint) 
   call db getter for an int = ival
   d.put(key.ival)
etc.

I know that is real pigeon python, but hope you get the basic idea. We will need to handle type carefully to avoid chaos in the Metadata definitions. It is part of what I think would be called the schema for mongodb anyway.

Lot here - want to hear a reaction before I try to implement this. I had a AttributeMap object that turned out to not be right here. It had relational db concepts embedded in it.

wangyinz commented 4 years ago

The design is clearly presented and I think this should work as supposed. The only thing that I am not quite sure that I get it is the handling of aliases. I see the aliasmap contains the main key as the key and the aliases as the values in a multimap. We can then easily retrieve a list of the aliases of a given key. What I don't know is how to do the reverse - given a alias and get the main. It seems this will be a more frequent operation, so that the keys=['sta','chan','source_lat','source_lon'] in the second code snippet can be equivalent to something like keys=['wfdisc.sta','sitechan.chan','lat','lon']. Then, we probably also need to make sure that the aliases of different main keys won't collide. Not sure how to implement that efficiently in this design, but I do think it is doable.

Another improvement we should probably implement is to have some sort of default parameter for certain methods. One example I think is that we can probably set the default MDtype to string in the put method. This is to follow the simplistic design principle of most Python libraries, and the string type should be sufficient to cover any user defined keys. We should probably have another minimum set of defaults enforced following the attribute map file and that should be the schema for MongoDB.

pavlis commented 4 years ago

Good point about aliases. I cloned that from AttributeMap from which this was derived. There it was a kind of antelope thing where one can be lazy and ask for something like sta if the only table in the view had a sta field (e.g. a subset of wfdisc).

My thought on the use of an alias here was to provide an alternative namespace for the same concept. e.g. someone might want a sac namespace that would allow people to use names wired into their brains. This should not be discouraged as the use I would see would typically be this pseudocode for python getters: 1) if(key defined) i) get type expected ii) call getter for expected type iii) return that value 2) else i) foreach name in aliaslist a) call this same getter recursively b) if return valid value return it break ii) end foreach iii) if a valid value was found return it

Used this way the alias list would be customized as a convenience for a user. e.g for the SAC example one might alias KSTA to sta. A sanity check would be needed to make sure the expected type of KSTA and sta were the same. In fact, any constructor should assure any key in the alias table has an entry in the cross references. When I consider this the multimap as defined above is not the right container. This needs some thought. Thanks for pointing out this weakness.

pavlis commented 4 years ago

Ian, I have a related question that I think is much simpler: What format should we use for this? As a keyed container with two attributes (concept and type) we need something like json or an antelope pf structure. json would map directly into mongo and we could then easily store the data in the database. Otherwise, we'll need some data directory as I put into the repository.

pavlis commented 4 years ago

Another gap I just noticed reading back through what I just added. The class needs a method with this signature: bool defined(const string key;

Need as a first, bombproof test for a sanity check on the key.

A different but related problem will occur on db loads (getters). Loads should be more bombproof loading what is possible and telling the user what it didn't find. Not sure how to do that without more knowledge of mongodb - a decision to kick down the road.

wangyinz commented 4 years ago

For the alias handling, I think the multimap is probably fine. We can just define another map that contains the key-value pair of all aliases and their main key for the reverse look up. So the pseudocode of getter becomes:

class MetadataDefinitions
{
  ...
  private:
    map<std::string,MDtype> tmap;
    map<std::string,string> cmap;
    multimap<std::string,std::string> aliasmap;
    map<std::string,string> reversemap;
};

if (key defined in aliasmap)
  get type expected from tmap
  call getter for expected type
  return that value
elseif (key defined in reversemap)
  get the main key from reversemap
  get type expected from tmap with the main key
  call getter for expected type
  return that value
else
  exception "key undefined"

For the format, I definitely think we should go with something more modernized. I think JSON is definitely a great candidate, but it does have the drawback of not so friendly to read or type for humans. I think the alternative that worth considering is YAML. It is a more powerful and cleaner format, and there are converters available to turn YAML into JSON, so we should be able to make MongoDB work with it.

pavlis commented 4 years ago

Makes sense. I'll work on it and check this in. Probably will use the pybind11 branch for now as building a wrapper for that class will be critical for it to be usable.

pavlis commented 4 years ago

Made some progress on this issue in the process of writing a function called obspy2mspass. That function is designed to convert data stored an obspy Trace object and convert it to a C++ TimeSeries. At the same time I was running tests on some of the other wrappers. In the process I learned a few key issues:

  1. Any integer data is ready converted to a native python int using the int() function that is core in python. The int function converts any integer type input to an int64 (long).
  2. Similarly the float() function takes any real type input and promotes it to a double (real64). I found this works even with things like x=np.float32(3.222) where np is Numpy.
  3. I found an interesting behavior for the comparable function str(). Odd you can enter something like this:
    x=2.334890 xs=str(x) and you get '2.334890'. Similar for ints. Makes for a nice bombproof approach for storing string data.
  4. boolean is a bit weirder. There is a bool() function that has a C like behavior. That is, the data it handles is purely numberic and anything that isn't 0 turns true. i.e. bool(0) returns false bool(2.344) returns true bool(4) returns true bool('xyz') returns false (not sure what would happen with bool('\0') which I think is ascii 0.

Similarly I found bool returns or arguments passed to C always are treated as int that behave as noted (0 is false and thing else is true). Sensible since that is how C behaves, but kind of weird disconnect with python where a bool is a class 'bool'. Also python as a weird limitation. It will accept a=True but fail with a=true

In short, the model that meshes well with python is that in Metadata (and the stats dict like entity in obspy) we should demand all int are int64 and all floats are double. strings and booleans are pretty simple if we just remember a few basic rules noted above.

wangyinz commented 4 years ago

That's great! Just a note here bool('xyz') should return True. Actually, according to this tutorial, the only things that evaluate to False are empty values, such as (), [], {}, "", the number 0, the value None, and a class with a __len__ function that returns 0 or False.