mspass-team / mspass

Massive Parallel Analysis System for Seismologists
https://mspass.org
BSD 3-Clause "New" or "Revised" License
28 stars 11 forks source link

history_object collection change suggestion #456

Open pavlis opened 10 months ago

pavlis commented 10 months ago

This is the latest in a series of posts here on changes I am making for what will be our next numbered release of MsPASS. The main change is in the Database class readers and writers. In the process of getting the revisions to pass our test suite I uncovered an issue with how we currently handle saving ProcessingHistory. I would like to note up front that changes we make here will be pretty much internal since we haven't really advertised nor fully supported the object-level history up to now anyway. Hence, I don't think what I propose here will have any impact on any user.

The issue I have with the current implementation is it mixes up two different "id" concepts: (1) documents in MongoDB use an ObjectId as a unique tag for each document that always has the key "_id" and (2) in the object-level history container each processing step is given save tags that give it a unique identification including a guaranteed id tag referenced with the id method of ProcessingHistory. The problem is v1 of Database interweave these two id concepts by, in some cases, using the str() representation of an "_id" value to get an id value. At best that causes a confusion and at worst it could cause a collision of id values if reused improperly.

Here is the current schema definition of the history_object collection in the mspass schema:

  history_object:
    schema:
      _id:
        type: string
        concept: UUID used to define an unique entry in history object collection.
        constraint: required
      processing_history:
        type: bytes
        concept: serialized content of ProcessingHistory.
        constraint: required
      wf_Seismogram_id:
        reference: wf_Seismogram
        constraint: xref_key
      wf_TimeSeries_id:
        reference: wf_TimeSeries
        constraint: xref_key
      alg_id:
        concept: ObjectId used to define a unique algorithm(alg_name + parameters)
        reference: history_global
        constraint: xref_key
      alg_name:
        type: string
        concept: the name of the algorithm
        constraint: required

First, effectively overloading "_id" like that to redefine it as a string seems alike a bad idea to me. "_id" is so deeply wired into MongoDB that is just not a good idea. My proposed revision is this:

  history_object:
    schema:
       _id:
          type: ObjectID
          concept: ObjectID used to define a unique entry in this collection (normally generated by MongoDB)
          constraint: required
      save_uuid:
        type: string
        concept: UUID used to define an unique entry in history object collection.  Should be the uuid of the save function step of processing.
        constraint: required
      save_stage:
        type:  int
        concept:  Processing stage of save process that generated this document.  
        constraint:  required
      processing_history:
        type: bytes
        concept: serialized content of ProcessingHistory.
        constraint: required
      wf_Seismogram_id:
        reference: wf_Seismogram
        constraint: xref_key
      wf_TimeSeries_id:
        reference: wf_TimeSeries
        constraint: xref_key
      alg_id:
        concept: ObjectId used to define a unique algorithm(alg_name + parameters)
        reference: history_global
        constraint: xref_key
      alg_name:
        type: string
        concept: the name of the algorithm
        constraint: required
     job_name:
       type: string
       concept:   Name tag assigned to a specific workflow (job).  
       constraint:  optional
    job_id:
       type:  string
       concept:  Instance of running a specific workflow (job).  Used when same basic workflow is run with small perturbations (e.g. multiple runs for a parameter test)
       constraint:  optional

Noting this makes "_id" the normal use and adds: save_uuid, save_stage, job_name, and job_id. In my new implementation job_name and job_id are only saved if the they were defined in the workflow consistent with making them "optional".

pavlis commented 10 months ago

PS: Forgot to mention with the above change my new implementation no longer fails running the test suite in the way that led me to post this issue.