This is the latest in a series of posts here on changes I am making for what will be our next numbered release of MsPASS. The main change is in the Database class readers and writers. In the process of getting the revisions to pass our test suite I uncovered an issue with how we currently handle saving ProcessingHistory. I would like to note up front that changes we make here will be pretty much internal since we haven't really advertised nor fully supported the object-level history up to now anyway. Hence, I don't think what I propose here will have any impact on any user.
The issue I have with the current implementation is it mixes up two different "id" concepts: (1) documents in MongoDB use an ObjectId as a unique tag for each document that always has the key "_id" and (2) in the object-level history container each processing step is given save tags that give it a unique identification including a guaranteed id tag referenced with the id method of ProcessingHistory. The problem is v1 of Database interweave these two id concepts by, in some cases, using the str() representation of an "_id" value to get an id value. At best that causes a confusion and at worst it could cause a collision of id values if reused improperly.
Here is the current schema definition of the history_object collection in the mspass schema:
history_object:
schema:
_id:
type: string
concept: UUID used to define an unique entry in history object collection.
constraint: required
processing_history:
type: bytes
concept: serialized content of ProcessingHistory.
constraint: required
wf_Seismogram_id:
reference: wf_Seismogram
constraint: xref_key
wf_TimeSeries_id:
reference: wf_TimeSeries
constraint: xref_key
alg_id:
concept: ObjectId used to define a unique algorithm(alg_name + parameters)
reference: history_global
constraint: xref_key
alg_name:
type: string
concept: the name of the algorithm
constraint: required
First, effectively overloading "_id" like that to redefine it as a string seems alike a bad idea to me. "_id" is so deeply wired into MongoDB that is just not a good idea. My proposed revision is this:
history_object:
schema:
_id:
type: ObjectID
concept: ObjectID used to define a unique entry in this collection (normally generated by MongoDB)
constraint: required
save_uuid:
type: string
concept: UUID used to define an unique entry in history object collection. Should be the uuid of the save function step of processing.
constraint: required
save_stage:
type: int
concept: Processing stage of save process that generated this document.
constraint: required
processing_history:
type: bytes
concept: serialized content of ProcessingHistory.
constraint: required
wf_Seismogram_id:
reference: wf_Seismogram
constraint: xref_key
wf_TimeSeries_id:
reference: wf_TimeSeries
constraint: xref_key
alg_id:
concept: ObjectId used to define a unique algorithm(alg_name + parameters)
reference: history_global
constraint: xref_key
alg_name:
type: string
concept: the name of the algorithm
constraint: required
job_name:
type: string
concept: Name tag assigned to a specific workflow (job).
constraint: optional
job_id:
type: string
concept: Instance of running a specific workflow (job). Used when same basic workflow is run with small perturbations (e.g. multiple runs for a parameter test)
constraint: optional
Noting this makes "_id" the normal use and adds: save_uuid, save_stage, job_name, and job_id. In my new implementation job_name and job_id are only saved if the they were defined in the workflow consistent with making them "optional".
This is the latest in a series of posts here on changes I am making for what will be our next numbered release of MsPASS. The main change is in the
Database
class readers and writers. In the process of getting the revisions to pass our test suite I uncovered an issue with how we currently handle savingProcessingHistory
. I would like to note up front that changes we make here will be pretty much internal since we haven't really advertised nor fully supported the object-level history up to now anyway. Hence, I don't think what I propose here will have any impact on any user.The issue I have with the current implementation is it mixes up two different "id" concepts: (1) documents in MongoDB use an ObjectId as a unique tag for each document that always has the key "_id" and (2) in the object-level history container each processing step is given save tags that give it a unique identification including a guaranteed id tag referenced with the
id
method ofProcessingHistory
. The problem is v1 ofDatabase
interweave these two id concepts by, in some cases, using the str() representation of an "_id" value to get an id value. At best that causes a confusion and at worst it could cause a collision of id values if reused improperly.Here is the current schema definition of the
history_object
collection in the mspass schema:First, effectively overloading "_id" like that to redefine it as a string seems alike a bad idea to me. "_id" is so deeply wired into MongoDB that is just not a good idea. My proposed revision is this:
Noting this makes "_id" the normal use and adds:
save_uuid, save_stage, job_name
, andjob_id
. In my new implementationjob_name
andjob_id
are only saved if the they were defined in the workflow consistent with making them "optional".