Open wangyinz opened 4 years ago
REALLY good point and the right time to address it! It will already be a chore to clean this up, but not that difficult with search tools.
An issue you didn't point out that we should consider, however, is this. ObjectId as defined by MongoDB should probably be viewed as an "implementation detail". Given it is impossible to predict fully the future development path of MongoDB or database systems in general, we should think some about the long term stability of this software. An historical example is a system developed by Crotwell at South Carolina the was a predecessor or web services. They based the service on a technology that withered on the vine called CORBA. When CORBA died so did most of that development, although I think a strong case can be made the experience helped make IRIS's web services better than it might have otherwise. THE POINT, however, is strive to make the design center on concept where alternative technologies could be plugged in replace a potentially unstable component. There are two of them here: MongoDB and SPARK. The container technology is largely orthogonal to anything we do. Others like cmake create major headaches if they went away, but would be more easy to isolate. I'm sure the point is understood so no reason to continue the list of dependencies.
Because of the above I would urge we not depend heavily on anything related to ObjectIds, or at least hid any such detail behind the user interface. For example, for the site and source "normalize" python functions ObjectIds make sense to use as the link between the wf and site or source collection. The entire link/join operation is mongo specific, so using objectid is sensible for efficiency alone. On the other hand, having alternative indices to link source and site collections to other data are more portable with something like an integer key we could call something like site_id_antelope.
So, I think here are some things we need to follow up on this important insight:
I just had the final thought that this would also be an important time for you to review the entire namespace in mspass.yaml for consistency of naming. As the one who put the original list together it is hard to have an independent perspective. We might also want to ask Shawn to look at this list and suggest changes.
I absolutely agree with all these points. Hiding the implementation details and separate it from the actual concept is really the key here. So the concept here is very clear: a unique identifier for cross-referencing among the metadata. Then, the implementation detail can actually be further considered as two different questions: what it is, and what it appears to the user is. So the design needs to make the second one works no matter what the answer to the first question is.
We now have two answers to the first question: a numeric id (i.e. Antelope), or the ObjectId (i.e. MongoDB). Actually, I would argue that the ObjectId should be treated as an implementation of a more general case of a string id since it can be easily converted. Also, similar concept of id is implemented in many other NoSQL databases. The advantage of numeric id is probably the convenience, and the advantage of a string id is that it is easier to maintain its uniqueness. I suggested earlier to use string all the time in our Python code when handling ObjectIds. Considering that it is actually straightforward to store a numeric id in string, we might just want to make all the use of ids with the type of string. When writing to MongoDB, we will need to catch the case where the id is not a valid ObjectId and regenerate one. This is probably OK since if we write to Antelope, we will likely need to regenerate the numeric ids, too, unless it is a empty database. Definitely need to think carefully how to actually implement this.
I was gonna push this topics off till I finish the unit testing setup for the Python code, then I realize this schema change will actually affect a lot of the code we have, so I decided to review the schema first and also try to come up a consistent strategy on the use of various ids.
This is the modification I would suggest: https://github.com/wangyinz/mspass/pull/56. I basically followed the idea of using string as the type for those ids. The ids include: site_id
, chan_id
, source_id
, wf_id
, and gridfs_wf_id
. Note that the naming follows "collection name" + "_id" pattern. The only exception is the chan_id
. It would have been called sitechan_id
, but that name would be really misleading. Or, we could rename the sitechan collection to channel collection, which would then be used as a combination of the sitechan and sensor tables (and probably instrument table too) in terms of CSS3.0 schema. Not sure how much we want our schema to be compatible with CSS3.0, but I do think it is not a bad idea to simplify the schema.
The most important thing to discuss here is that I chose the string id as oppose to the numeric one. I prefers this option based on the following reasons.
The numeric id idea is really inherited from Antelope (or CSS3.0), however, it appears to me, because it is designed as a relational database, the use of "id" in that schema is actually avoided when possible. Not sure if that is true, but that is my interpretation on the mixing use of sta, chan, and a bunch of ids as the primary keys. The issue here is that the index of some tables are ids but others are actual keys like sta, chan. ID is not something essential to a relational database. However, in MongoDB (or most of the NoSQL database), ID is actually more important, because every record has a implicit id assigned, and it is used as the default index. Even if we create a different index, the default _id would still be there, and the new index will likely be less efficient or even redundant. That's why I think we better make use of these ids whenever possible.
Also, I don't think going this route will make MsPASS more incompatible with CSS3.0 standard. I noticed that in our original schema, we already introduced similar incompatibility. The first one is the use of wfid
, where we already introduced the concept of ObjectId. Then, we introduced the site_id
which is not originally defined in CSS3.0. We would have had to rearrange the database when importing and exporting to Antelope anyway.
I also think it is reasonable to make MongoDB (or any similar NoSQL database) as the core component of MsPASS, at least at this stage of the project. So the use of string as id should work the best. Also, even if we want to swap MongoDB with some other SQL database, I think the string id will still be valid. Note that id is just a unique identifier, and I am pretty sure most of the popular SQL database will have no problem indexing with strings. Actually, we will only be incompatible with database of predefined schema such as Antelope. Since I am pretty sure we won't be using Antelope as the underlying database engine in this project, this is actually not a problem at all.
Based on all these, I think it should be pretty save to use string type for all the ids. Please let me know what do you think.
btw, I retained evid
and orid
as integer types in the schema. This provides some compatibility when importing data from Antelope namespace, but this may be redundant in a core schema definition. Maybe we should have those in some schema extensions.
Well stated. I am glad you took control of this and set this down as a clear guideline. An important next step is to modify the documentation file on the schema adding a (much shorter) description of what an id is and how they relate to the database structure. A few specifics:
gridfs_wf_id
and source_id
. Well, first of all, my answer to the question that which of the ids need to be ObjectID is actually that it depends. MongoDB doesn't care whether those ids are ObjectID or not as it always uses _id
internally. To my understanding, MongoDB uses ObjectID to ensure uniqueness of the documents, so the content of any of those id fields does not matter until we are putting the data into a database. Only by then, we will need to ensure the ids are unique. If there is no database and the size of the data is small enough, a customized naming scheme (for all those ids) would work perfectly for data management (e.g. that's how people uses SAC filename). Database is the tool to help manage large data set where arbitrary naming scheme no longer works. In this case, the uniqueness of the ids is actually ensured and maintained by the database instead of human brain, and it makes more sense to use an unreadable ObjectID there. By making the ids string, our system can actually adapt to both scenario just fine. The question here is actually not which id needs to be ObjectID but when do we need to use ObjectID, and how to ensure the uniqueness. We will need to think about it more carefully. For now, my thought is that we might need to come up with a mechanism to check the uniqueness and update related documents when a new record is written to the database. I do need to formulate the details, which I don't really know yet.Station
object, it seems unnecessary to keep those channel information in the database, and we won't need the chan_id
anymore. Just found another issue with current schema. We use source_id
as the unique id that replaces the concept of evid
in antelope. However, there is also the orid
, which we retained in our schema definition for compatibility. This is confusing in two different ways. First, we need to make sure our users understand the difference between evid
(source_id
) and orid
. Then, because an evid
can have multiple orid
, isn't that gonna be a problem if a dataset imported has multiple orid
for a evid
? Because they share the same evid
, which has to be unique in our schema, we need to figure out a way to handle such scenario. I think in our schema design, we implicitly assumed that a prefor
is chosen for all events. We probably, at least, need to make that clear to the user.
Good point.
First, I don't think we should try to support the idea of multiple orids for a common evid. That is a detail network operators need to be concerned about, but it is a detail handled well by Antelope and any system using the CSS3.0 schema. I suggest we take the attitude that the user should treat that as a data assembly problem and assure the dataset to be handled is "clean" in the sense that there is one and only one location estimate associated with a given waveform that is to be processed. That way our schema can ignore the whole idea of prefor and just allow evid and orid as potentially useful integer tags.
That is my opinion anyway.
I totally agree. Since we are using source_id
as the unique id for each event, I think that assumption is baked in our schema design anyway. I guess that is another reason to have some of those keys put in a schema extension as oppose to what should be considered the core schema. Both evid
and orid
belongs there.
When re-organizing the code, I got a chance to carefully look at how each functions are implemented, and I noticed that the use of the concept "id" may be a bit confusing. We actually have a mix of two different kinds of "id"s being used. The first one is the
ObjectId
which is the default index of any documents in the MongoDB. Then, we also inherited the "id" concept from Antelope (or CSS3.0 to be more general), which is an integer to index entries in a table, and the largest number usually matches the total number of entries in the table.For the
wfid
, we completely ignored the Antelope convention and used ObjectId (with the default name_id
in the wf collection) as the only index. They were referenced in the code as well as the elog collection with the keywfid_string
.For
job_id
, integer is used due to the ErrorLogger implemented in C++ knowing nothing about the ObjectId type. Thedb._save_elog
method will return a list of ObjectId after error log being inserted into database. However, these ObjectId are not currently used anywhere in the code.For
site_id
andsource_id
, they are both integers directly created by importing from Antelope into the wf collection. After normalized and created the site and source collections, they are retained in the new collections with the keys ofoid_site
andsource_oidstring
referring to the objectid of the new documents added to the wf collection.There may be more of it, but I think I have all the use of IDs covered here. I think we need to standardize the use of all these IDs. For wf_id, site_id and source_id, we should probably use them for the ObjectId only. The imported
site_id
andsource_id
from Antelope should be aliased to something likesite_id_antelope
andsource_id_antelope
. Then, when using ObjectIds in the code, we should always convert them into string type so that we don't need to use the confusing keys likesource_oidstring
.The
job_id
seems to be a problem as that is implemented in C++ and we need to use the integer type to make it compatible. Maybe we should just left it as is as an exception. What do you think?