Consistency on the use of "id"s

mspass-team / mspass

Massive Parallel Analysis System for Seismologists

https://mspass.org

BSD 3-Clause "New" or "Revised" License

31 stars 12 forks source link

Consistency on the use of "id"s #55

Open wangyinz opened 4 years ago

wangyinz commented 4 years ago

When re-organizing the code, I got a chance to carefully look at how each functions are implemented, and I noticed that the use of the concept "id" may be a bit confusing. We actually have a mix of two different kinds of "id"s being used. The first one is the ObjectId which is the default index of any documents in the MongoDB. Then, we also inherited the "id" concept from Antelope (or CSS3.0 to be more general), which is an integer to index entries in a table, and the largest number usually matches the total number of entries in the table.

For the wfid, we completely ignored the Antelope convention and used ObjectId (with the default name _id in the wf collection) as the only index. They were referenced in the code as well as the elog collection with the key wfid_string.

For job_id, integer is used due to the ErrorLogger implemented in C++ knowing nothing about the ObjectId type. The db._save_elog method will return a list of ObjectId after error log being inserted into database. However, these ObjectId are not currently used anywhere in the code.

For site_id and source_id, they are both integers directly created by importing from Antelope into the wf collection. After normalized and created the site and source collections, they are retained in the new collections with the keys of oid_site and source_oidstring referring to the objectid of the new documents added to the wf collection.

There may be more of it, but I think I have all the use of IDs covered here. I think we need to standardize the use of all these IDs. For wf_id, site_id and source_id, we should probably use them for the ObjectId only. The imported site_id and source_id from Antelope should be aliased to something like site_id_antelope and source_id_antelope. Then, when using ObjectIds in the code, we should always convert them into string type so that we don't need to use the confusing keys like source_oidstring.

The job_id seems to be a problem as that is implemented in C++ and we need to use the integer type to make it compatible. Maybe we should just left it as is as an exception. What do you think?

pavlis commented 4 years ago

REALLY good point and the right time to address it! It will already be a chore to clean this up, but not that difficult with search tools.

An issue you didn't point out that we should consider, however, is this. ObjectId as defined by MongoDB should probably be viewed as an "implementation detail". Given it is impossible to predict fully the future development path of MongoDB or database systems in general, we should think some about the long term stability of this software. An historical example is a system developed by Crotwell at South Carolina the was a predecessor or web services. They based the service on a technology that withered on the vine called CORBA. When CORBA died so did most of that development, although I think a strong case can be made the experience helped make IRIS's web services better than it might have otherwise. THE POINT, however, is strive to make the design center on concept where alternative technologies could be plugged in replace a potentially unstable component. There are two of them here: MongoDB and SPARK. The container technology is largely orthogonal to anything we do. Others like cmake create major headaches if they went away, but would be more easy to isolate. I'm sure the point is understood so no reason to continue the list of dependencies.

Because of the above I would urge we not depend heavily on anything related to ObjectIds, or at least hid any such detail behind the user interface. For example, for the site and source "normalize" python functions ObjectIds make sense to use as the link between the wf and site or source collection. The entire link/join operation is mongo specific, so using objectid is sensible for efficiency alone. On the other hand, having alternative indices to link source and site collections to other data are more portable with something like an integer key we could call something like site_id_antelope.

So, I think here are some things we need to follow up on this important insight:

We need to carefully sort out what things we call "ids" are to be used in what context and try to make sure we don't have blinders on from limited experience. This may be one of our first topics where we need help from a broader community. This is timely for my current development of web service imports using obspy as I need to regularize how ids are handled for the site and source collections anyway. The previous "normalization" routines were driven by the challenge of loading data assembled through antelope. The model with web services is quite different so they need to be unified anyway. So, let me take that on and I will add to this discussion as I think this through.
I think we should draft a page or section for the documentation stating in clear terms how we utilize id attributes of all types. That starts with what an id is, what its purpose is, the numerical types that can represent it, and where we use them and where we avoid them. The later is important because a hidden axiom of Antelope I know only because I was so connected with the group when it was originally developed is this: Dan Quinlan hated the idea of integer ids and designed Datascope (later a component of Antelope) to minimize their use. He was right in that maintaining integer ids for the kind of transactions that happen building earthquake catalogs make integer ids very hard to maintain at times. I think in the MsPASS context that is not true. If the primary objective is assembling a clean "data set" that can be processed in parallel, all the ids live in one place and are maintained by a central workflow manager (I think our model audience is a single researcher or group).
When we agree on our own design principles about what ids are for and how they should be used, we should systematically set up the names to make the differences in usage clear. That is straightforward, but tedious. Does, however, need to be done soon as it will only get more challenging as the namespace grows.

I just had the final thought that this would also be an important time for you to review the entire namespace in mspass.yaml for consistency of naming. As the one who put the original list together it is hard to have an independent perspective. We might also want to ask Shawn to look at this list and suggest changes.

wangyinz commented 4 years ago

I absolutely agree with all these points. Hiding the implementation details and separate it from the actual concept is really the key here. So the concept here is very clear: a unique identifier for cross-referencing among the metadata. Then, the implementation detail can actually be further considered as two different questions: what it is, and what it appears to the user is. So the design needs to make the second one works no matter what the answer to the first question is.

We now have two answers to the first question: a numeric id (i.e. Antelope), or the ObjectId (i.e. MongoDB). Actually, I would argue that the ObjectId should be treated as an implementation of a more general case of a string id since it can be easily converted. Also, similar concept of id is implemented in many other NoSQL databases. The advantage of numeric id is probably the convenience, and the advantage of a string id is that it is easier to maintain its uniqueness. I suggested earlier to use string all the time in our Python code when handling ObjectIds. Considering that it is actually straightforward to store a numeric id in string, we might just want to make all the use of ids with the type of string. When writing to MongoDB, we will need to catch the case where the id is not a valid ObjectId and regenerate one. This is probably OK since if we write to Antelope, we will likely need to regenerate the numeric ids, too, unless it is a empty database. Definitely need to think carefully how to actually implement this.

wangyinz commented 4 years ago

I was gonna push this topics off till I finish the unit testing setup for the Python code, then I realize this schema change will actually affect a lot of the code we have, so I decided to review the schema first and also try to come up a consistent strategy on the use of various ids.

This is the modification I would suggest: https://github.com/wangyinz/mspass/pull/56. I basically followed the idea of using string as the type for those ids. The ids include: site_id, chan_id, source_id, wf_id, and gridfs_wf_id. Note that the naming follows "collection name" + "_id" pattern. The only exception is the chan_id. It would have been called sitechan_id, but that name would be really misleading. Or, we could rename the sitechan collection to channel collection, which would then be used as a combination of the sitechan and sensor tables (and probably instrument table too) in terms of CSS3.0 schema. Not sure how much we want our schema to be compatible with CSS3.0, but I do think it is not a bad idea to simplify the schema.

The most important thing to discuss here is that I chose the string id as oppose to the numeric one. I prefers this option based on the following reasons.

The numeric id idea is really inherited from Antelope (or CSS3.0), however, it appears to me, because it is designed as a relational database, the use of "id" in that schema is actually avoided when possible. Not sure if that is true, but that is my interpretation on the mixing use of sta, chan, and a bunch of ids as the primary keys. The issue here is that the index of some tables are ids but others are actual keys like sta, chan. ID is not something essential to a relational database. However, in MongoDB (or most of the NoSQL database), ID is actually more important, because every record has a implicit id assigned, and it is used as the default index. Even if we create a different index, the default _id would still be there, and the new index will likely be less efficient or even redundant. That's why I think we better make use of these ids whenever possible.

Also, I don't think going this route will make MsPASS more incompatible with CSS3.0 standard. I noticed that in our original schema, we already introduced similar incompatibility. The first one is the use of wfid, where we already introduced the concept of ObjectId. Then, we introduced the site_id which is not originally defined in CSS3.0. We would have had to rearrange the database when importing and exporting to Antelope anyway.

I also think it is reasonable to make MongoDB (or any similar NoSQL database) as the core component of MsPASS, at least at this stage of the project. So the use of string as id should work the best. Also, even if we want to swap MongoDB with some other SQL database, I think the string id will still be valid. Note that id is just a unique identifier, and I am pretty sure most of the popular SQL database will have no problem indexing with strings. Actually, we will only be incompatible with database of predefined schema such as Antelope. Since I am pretty sure we won't be using Antelope as the underlying database engine in this project, this is actually not a problem at all.

Based on all these, I think it should be pretty save to use string type for all the ids. Please let me know what do you think.

btw, I retained evid and orid as integer types in the schema. This provides some compatibility when importing data from Antelope namespace, but this may be redundant in a core schema definition. Maybe we should have those in some schema extensions.

pavlis commented 4 years ago

Well stated. I am glad you took control of this and set this down as a clear guideline. An important next step is to modify the documentation file on the schema adding a (much shorter) description of what an id is and how they relate to the database structure. A few specifics:

I have no issue at all about making the keys strings. I presume that does not, however, mean all keys are ObjectIds. Oids are not very readable and there are good reasons to allow independent keys to oids to avoid obnoxious problems like Antelope's api being inconsistent about duplicate key problems. i.e. in mongo _id is always a unique key, but keys like source_id should be something more readable. Linking keys like gridfs_wf_id are appropriately expressed as ObjectId strings, but things humans need to reference (e.g. source_id) would be cleaner as readable numbers. The good thing about strings is keys like source_id can be something even more readable. e.g. people could use some naming convention like source_id="2015:255:10:11" or "nuke1" and "nuke22". So, strongly suggest the ids that need to be ObjectID strings be tagged somehow in the "concept" field of the yaml file.
Let me clarify that Antelope's use of things like sta:chan:time as keys is not really part of the CSS3.0 schema. The two Dans that were founders of BRTT (Dan Quinlan and Danny Harvey) had a thing against integer ids. The reason is the very thing I said above that an integer id is not something most humans can deal with easily and keep track of cleanly. Computers like integer ids, but people can't maintain them without a robot to assist them. In reality CSS3.0 implementations with a system like Oracle would commonly use integer keys like wfid as the primary key and build somehing like sta:chan:time as a secondary hash index.
I don't think we have anything linked to a channel oriented collection at this time and I don't think we want it. I think the site collection will be sufficient because of the way quakeml is set up and utilized by obspy. Look at this page and you see that their Inventory object is the top level of a heirarchy: Inventory->Network->Station->Channel. The way I implemented the translation of this to mspass is I pretty much ignore Inventory and Network but map each Station object into a document on MongoDB. The documents should have a unique combination of network, station, location, and a time period. As usual they are actually keyed by ObjectId. The current implementation uses an integer site_id, but based on your guidance I think we can drop that and use site_id as as objectid string used only in the wf collection as a cross reference. A channel collection in this context would be one more nasty thing to maintain that is not necessary. I am 100% sure that to get channel information the best solution (at this point) is to use pickle to create Station objects from the desired document and extract the required channel metadata with obspy methods like get_orientation and response functions. This also maps very nicely to treating 3C data as atomic objects because 3C objects can normally ignore channel information.
There is no problem with having names like evid and orid as integers flying around as alternate ways to index data. Something we need to constantly advertise and make sure retain as a capability is to note make the metadata namespace completely rigid.

wangyinz commented 4 years ago

That's exactly the reason I made them string instead of ObjectID - a string id could be anything, including integers, so we have the best flexibility. I am not exactly sure how to implement the API to distinguish the difference you brought up in, for example, gridfs_wf_id and source_id. Well, first of all, my answer to the question that which of the ids need to be ObjectID is actually that it depends. MongoDB doesn't care whether those ids are ObjectID or not as it always uses _id internally. To my understanding, MongoDB uses ObjectID to ensure uniqueness of the documents, so the content of any of those id fields does not matter until we are putting the data into a database. Only by then, we will need to ensure the ids are unique. If there is no database and the size of the data is small enough, a customized naming scheme (for all those ids) would work perfectly for data management (e.g. that's how people uses SAC filename). Database is the tool to help manage large data set where arbitrary naming scheme no longer works. In this case, the uniqueness of the ids is actually ensured and maintained by the database instead of human brain, and it makes more sense to use an unreadable ObjectID there. By making the ids string, our system can actually adapt to both scenario just fine. The question here is actually not which id needs to be ObjectID but when do we need to use ObjectID, and how to ensure the uniqueness. We will need to think about it more carefully. For now, my thought is that we might need to come up with a mechanism to check the uniqueness and update related documents when a new record is written to the database. I do need to formulate the details, which I don't really know yet.
Thanks for that background. No wonder it always feels inconsistent when using Antelope's schema. It surely is convenient for what is used for though.
I completely agree. It appears to me that we need to remove the sitechan collection used here: https://github.com/wangyinz/mspass/blob/44ec3546e76f56cfeffdc812943dfbe5e16e4bcd/data/yaml/mspass.yaml#L91-L123, and probably put those keys in a sub-document within a site document if needed. If we are going to pickle the Station object, it seems unnecessary to keep those channel information in the database, and we won't need the chan_id anymore.
Yes, we can keep those names. I am just saying that it is more appropriate to be in a schema extension. Remember we discussed before about this idea of providing a mechanism for users to create yaml files as schema extension. Since we are not implementing that feature yet, I am not going to simplify the core schema now - not something of great importance now.

wangyinz commented 4 years ago

Just found another issue with current schema. We use source_id as the unique id that replaces the concept of evid in antelope. However, there is also the orid, which we retained in our schema definition for compatibility. This is confusing in two different ways. First, we need to make sure our users understand the difference between evid (source_id) and orid. Then, because an evid can have multiple orid, isn't that gonna be a problem if a dataset imported has multiple orid for a evid? Because they share the same evid, which has to be unique in our schema, we need to figure out a way to handle such scenario. I think in our schema design, we implicitly assumed that a prefor is chosen for all events. We probably, at least, need to make that clear to the user.

pavlis commented 4 years ago

Good point.

First, I don't think we should try to support the idea of multiple orids for a common evid. That is a detail network operators need to be concerned about, but it is a detail handled well by Antelope and any system using the CSS3.0 schema. I suggest we take the attitude that the user should treat that as a data assembly problem and assure the dataset to be handled is "clean" in the sense that there is one and only one location estimate associated with a given waveform that is to be processed. That way our schema can ignore the whole idea of prefor and just allow evid and orid as potentially useful integer tags.

That is my opinion anyway.

wangyinz commented 4 years ago

I totally agree. Since we are using source_id as the unique id for each event, I think that assumption is baked in our schema design anyway. I guess that is another reason to have some of those keys put in a schema extension as oppose to what should be considered the core schema. Both evid and orid belongs there.