Open pavlis opened 4 years ago
The design is clearly presented and I think this should work as supposed. The only thing that I am not quite sure that I get it is the handling of aliases. I see the aliasmap
contains the main key as the key and the aliases as the values in a multimap. We can then easily retrieve a list of the aliases of a given key. What I don't know is how to do the reverse - given a alias and get the main. It seems this will be a more frequent operation, so that the keys=['sta','chan','source_lat','source_lon']
in the second code snippet can be equivalent to something like keys=['wfdisc.sta','sitechan.chan','lat','lon']
. Then, we probably also need to make sure that the aliases of different main keys won't collide. Not sure how to implement that efficiently in this design, but I do think it is doable.
Another improvement we should probably implement is to have some sort of default parameter for certain methods. One example I think is that we can probably set the default MDtype to string in the put method. This is to follow the simplistic design principle of most Python libraries, and the string type should be sufficient to cover any user defined keys. We should probably have another minimum set of defaults enforced following the attribute map file and that should be the schema for MongoDB.
Good point about aliases. I cloned that from AttributeMap from which this was derived. There it was a kind of antelope thing where one can be lazy and ask for something like sta if the only table in the view had a sta field (e.g. a subset of wfdisc).
My thought on the use of an alias here was to provide an alternative namespace for the same concept. e.g. someone might want a sac namespace that would allow people to use names wired into their brains. This should not be discouraged as the use I would see would typically be this pseudocode for python getters: 1) if(key defined) i) get type expected ii) call getter for expected type iii) return that value 2) else i) foreach name in aliaslist a) call this same getter recursively b) if return valid value return it break ii) end foreach iii) if a valid value was found return it
Used this way the alias list would be customized as a convenience for a user. e.g for the SAC example one might alias KSTA to sta. A sanity check would be needed to make sure the expected type of KSTA and sta were the same. In fact, any constructor should assure any key in the alias table has an entry in the cross references. When I consider this the multimap as defined above is not the right container. This needs some thought. Thanks for pointing out this weakness.
Ian, I have a related question that I think is much simpler: What format should we use for this? As a keyed container with two attributes (concept and type) we need something like json or an antelope pf structure. json would map directly into mongo and we could then easily store the data in the database. Otherwise, we'll need some data directory as I put into the repository.
Another gap I just noticed reading back through what I just added. The class needs a method with this signature: bool defined(const string key;
Need as a first, bombproof test for a sanity check on the key.
A different but related problem will occur on db loads (getters). Loads should be more bombproof loading what is possible and telling the user what it didn't find. Not sure how to do that without more knowledge of mongodb - a decision to kick down the road.
For the alias handling, I think the multimap is probably fine. We can just define another map that contains the key-value pair of all aliases and their main key for the reverse look up. So the pseudocode of getter becomes:
class MetadataDefinitions
{
...
private:
map<std::string,MDtype> tmap;
map<std::string,string> cmap;
multimap<std::string,std::string> aliasmap;
map<std::string,string> reversemap;
};
if (key defined in aliasmap)
get type expected from tmap
call getter for expected type
return that value
elseif (key defined in reversemap)
get the main key from reversemap
get type expected from tmap with the main key
call getter for expected type
return that value
else
exception "key undefined"
For the format, I definitely think we should go with something more modernized. I think JSON is definitely a great candidate, but it does have the drawback of not so friendly to read or type for humans. I think the alternative that worth considering is YAML. It is a more powerful and cleaner format, and there are converters available to turn YAML into JSON, so we should be able to make MongoDB work with it.
Makes sense. I'll work on it and check this in. Probably will use the pybind11 branch for now as building a wrapper for that class will be critical for it to be usable.
Made some progress on this issue in the process of writing a function called obspy2mspass. That function is designed to convert data stored an obspy Trace object and convert it to a C++ TimeSeries. At the same time I was running tests on some of the other wrappers. In the process I learned a few key issues:
Similarly I found bool returns or arguments passed to C always are treated as int that behave as noted (0 is false and thing else is true). Sensible since that is how C behaves, but kind of weird disconnect with python where a bool is a class 'bool'. Also python as a weird limitation. It will accept a=True but fail with a=true
In short, the model that meshes well with python is that in Metadata (and the stats dict like entity in obspy) we should demand all int are int64 and all floats are double. strings and booleans are pretty simple if we just remember a few basic rules noted above.
That's great! Just a note here bool('xyz')
should return True
. Actually, according to this tutorial, the only things that evaluate to False
are empty values, such as ()
, []
, {}
, ""
, the number 0
, the value None
, and a class with a __len__
function that returns 0
or False
.
In working on the C++ wrapper problem I have learned that pybind11 creates a seamless binding between the C++ map container and the python dictionary. This is very good news as it makes the interface between the C++ code, python, and mongodb very clear. python can do the database transactions and easily get and put attributes to the C++ data objects.
The wrinkle in all that is that python is agnostic about type while C++ is "strongly typed". To make the interface work we need a way to quickly and efficiently sort that out. I propose the C++ class here be used for that purpose. Writing the wrappers for this to python will be easy ow that I have this mastered:
Once wrapped a typical usage would be something like this:
The load_metadata script would have lines like this (not necessarily valid python - I'm still a bit rusty) also omiting any error handlers
I know that is real pigeon python, but hope you get the basic idea. We will need to handle type carefully to avoid chaos in the Metadata definitions. It is part of what I think would be called the schema for mongodb anyway.
Lot here - want to hear a reaction before I try to implement this. I had a AttributeMap object that turned out to not be right here. It had relational db concepts embedded in it.