mspass-team / mspass

Massive Parallel Analysis System for Seismologists
https://mspass.org
BSD 3-Clause "New" or "Revised" License
28 stars 11 forks source link

bug in save_data #433

Open pavlis opened 1 year ago

pavlis commented 1 year ago

@wangyinz and I had exchanged some email about this problem, but I now know the cause of the problem. I am leaving this entry here for the record. This is an urgent bug fix as it creates wf_TimeSeries entries that are pretty much useless. The problem was found in revisions to the getting_started tutorial. Within the current version of that tutorial is this block of code:

i=0
for doc in cursor:
    if i in keepers: 
        origin_time=doc['time']
        # We need the ObjectId of the source to provide a cross reference to link our 
        # waveform data to the right source.  Better to do this now than later as association 
        # can be a big challenge
        id=doc['_id']

        print('Starting on event number',i,' with origin time=',UTCDateTime(origin_time))
        stime=origin_time+start_offset
        etime=origin_time+end_offset
        strm=client.get_waveforms(
            starttime=UTCDateTime(stime),
            endtime=UTCDateTime(etime),
            network='TA',
            channel='BH?',
            location='*'
        )
        # The output of get_waveforms is an obspy Stream object. Stream objects are iterable 
        # so we work through the group, converting each to a MsPASS TimeSeries, and then saving 
        # the results to the database. \
        import pdb; pdb.set_trace()
        for d in strm:
            d_mspass=Trace2TimeSeries(d)
            # Here is where we save the id linked to the source collection. 
            # We use a MsPASS convention that such metadata have name collection_id
            d_mspass['source_id'] = id
            d_mspass=db.save_data(d_mspass)
            count += 1
    else:
        print('Skipping event number ',i,' with origin time=',UTCDateTime(origin_time))
    i += 1

Noting this job is fetching waveform data from iris with obspy. Then all it is doing is converting the waveforms to mspass data objects and saving them to the database with the default use of the db.save_data function.

The bug is that this context creates incomplete wf_TimeSeries documents. The symptom I saw was that the stock seed station codes "net", "sta", and "chan" where not being saved.

My code block above shows how I enabled pdb to find this error while running a notebook. I preserve that here as it is a very very helpful trick to debug a notebook. It actually goes way back and we need to add something about that usage in a new user manual section we need to produce on debugging a workflow.

With pdb I was able to figure out what happened. The logic of the following section of the save_data method is flawed:

    927             for k in copied_metadata.keys():
    928                 if save_schema.is_defined(k):
    929                     if save_schema.readonly(k):
    930                         if k in changed_key_list:
    931                             newkey = "READONLYERROR_" + k
    932                             copied_metadata.change_key(k, newkey)
    933                             mspass_object.elog.log_error(
    934                                 "Database.save_data",
    935                                 "readonly attribute with key="
    936                                 + k
    937                                 + " was improperly modified.  Saved changed value with key="
    938                                 + newkey,
    939                                 ErrorSeverity.Complaint,
    940                             )
    941                         else:
    942                             copied_metadata.erase(k)

where I pasted the section from pdb created with the list command. The problem is that in the context the workflow above we have this value for changed_key_list:

ipdb>  print(changed_key_list)
{'time_standard', 'utc_convertible', 'starttime', 'source_id', 'npts'}

But note this line:

ipdb>  print(save_schema.readonly("sta"))
True

I get the same result for "net" and "chan". Because "net", "sta", and "chan" are not in the changed_key_list the above logic causes them to be erased from the data inserted to form the new document for the data being saved.

I'm not quite sure how to fix this without breaking something else or I would blunder forward. This exists because of the need to handle the "readonly" data problem, but the logic here fails because in this context "net", "sta", and "chan" really aren't changed because they were defined during the construction of the data object and didn't "change" but were created.

This urgently needs to be fixed, but as I said I'm not sure what the right solution is that won't create a different problem. One thing I propose is that we should consider making "net", "sta", "chan", and "loc" special attributes. The four letter word of SEED is so locked into those four attributes they may need special designation.

The other element is that when I look a bit deeper I think the root of this problem is treating updates and adds for documents somewhat similarly in the Database class may be the root of this problem. That is, I think all metadata traffic in Database goes through the update_metadata method. Seems we ought to have a way to handle data more cleanly when the operation is "add" instead of "update". That is the root of this problem, I think.

wangyinz commented 1 year ago

I think we've been through this issue before. Basically the core of the problem is in a given workflow, what should be readonly and what shouldn't be. I remember that our conclusion is that the data download and data processing workflows are distinctively different, and the problem here is using the wrong schema for the wrong workflow. I though we got this resolved by having different sets of schema, and the mspass_lite.yaml should not have this problem.

I just checked and can see that it does not have the problematic readonly definition: https://github.com/mspass-team/mspass/blob/44aa03372470bd3f8a18eb6636698c247de1c0e2/data/yaml/mspass_lite.yaml#L209-L247 If you use this one, it should save the miniseed codes correctly I think.

pavlis commented 1 year ago

I'm testing this further after the @wangyinz comment above. I think I did make an error in the above debugging wherein I meant to use "mspass_lite" but got "mspass" for the schema instead. However, that mistake may have been good because it detected this error. That is, this particular error is really bad. If a user innocently did this (remember mspass is the default schema) they would create a database that was unusable. The reason is that when the wf_TimeSeries collection is created in the above context there are no values of net, sta, chan in the documents. Since I didn't do normalization to define a "channel_id" attribute when saved that way there is no way the waveforms indexed by the wf_TimeSeries collection can be associated with the correct channel metadata. I might be able to make this tutorial work by using mspass_lite (I'm running that test as I'm composing this) but that is just a temporary workaround if it works.

I looked at the code for the save_data method and conclude the fundamental problem here is that the code for that function has been through too many modifications to patch issues without rethinking the overall design. I see a couple things I think we should fix:

  1. I think that when we run in "promiscuous" mode all metadata contents should be saved without any edits of any kind. i.e. everything is just saved without any tests. A slight modification is it might still be prudent to do type tests or tests for nulls.
  2. I think we need to encapsulate all the metadata manipulation code into one or more private methods that any save function can call. The recently added save_ensemble_binary_file method, for example, largely cloned this flawed code block. Creating a private method to handle this common task will regularize that behavior.
  3. Because seed/miniseed data is so universal for mspass we probably should rethink the default schema. In particular, I think we it would be appropriate to have net, sta, chan, and loc to allowed as plain value keys in wf_TimeSeries and wf_Seismogram. By that I mean a TimeSeries or Seismogram object would allow content like the following (in python format for clarity):
    {"net" : "TA:,
    "channel_net" : "TA",
    "sta" : "S22A:,
    "channel_sta" : "S22A",
    "chan" : "BHZ:,
    "channel_chan" : "BHZ"
    }

    Allowing item 3 would fix this bug. Item 1 and 2 would just clean up the code base.

pavlis commented 1 year ago

As noted I reran the offending workbook again with this change to define the database handle:

from mspasspy.db.database import Database
import mspasspy.client as msc
dbclient=msc.Client(schema='mspass_lite.yaml')
db = dbclient.get_database("getting_started")

I get the same behavior. That is very mysterious because I even looked at mspass_lite in the data/yaml directory and it does not have net, sta, chan, and loc set readonly. I am not sure why that is happening. @wangyinz you should be able to recreate this as I'm working with the current instance of the "getting_started" tutorial in the notebooks directory of the tutorial repository.

wangyinz commented 1 year ago

Oh, I see the problem. I don't think there is such a thing mspasspy.client.Client(schema='mspass_lite.yaml'). There is no schema option in the client. This is indeed a design flaw as we don't have anything other than the default database constructor to specify the schema. Currently, the schema should be set by the set_metadata_schema and set_database_schema methods. i.e.,

from mspasspy.db.schema import DatabaseSchema, MetadataSchema
from mspasspy.db.database import Database
import mspasspy.client 
dbclient = mspasspy.client.Client()
db = dbclient.get_database("getting_started")
db.set_metadata_schema(MetadataSchema('mspass_lite.yaml'))
db.set_database_schema(DatabaseSchema('mspass_lite.yaml'))
pavlis commented 1 year ago

Good, that explains why the workaround failed. What this did, however, remains a bug we need to eventually squash. I think we should, as I suggested above, use this as a reason to clean up the Database class implementation. We should probably not close this issue until that has been done.