Open pavlis opened 1 year ago
I think we've been through this issue before. Basically the core of the problem is in a given workflow, what should be readonly and what shouldn't be. I remember that our conclusion is that the data download and data processing workflows are distinctively different, and the problem here is using the wrong schema for the wrong workflow. I though we got this resolved by having different sets of schema, and the mspass_lite.yaml
should not have this problem.
I just checked and can see that it does not have the problematic readonly definition: https://github.com/mspass-team/mspass/blob/44aa03372470bd3f8a18eb6636698c247de1c0e2/data/yaml/mspass_lite.yaml#L209-L247 If you use this one, it should save the miniseed codes correctly I think.
I'm testing this further after the @wangyinz comment above. I think I did make an error in the above debugging wherein I meant to use "mspass_lite" but got "mspass" for the schema instead. However, that mistake may have been good because it detected this error. That is, this particular error is really bad. If a user innocently did this (remember mspass is the default schema) they would create a database that was unusable. The reason is that when the wf_TimeSeries collection is created in the above context there are no values of net, sta, chan in the documents. Since I didn't do normalization to define a "channel_id" attribute when saved that way there is no way the waveforms indexed by the wf_TimeSeries collection can be associated with the correct channel metadata. I might be able to make this tutorial work by using mspass_lite (I'm running that test as I'm composing this) but that is just a temporary workaround if it works.
I looked at the code for the save_data
method and conclude the fundamental problem here is that the code for that function has been through too many modifications to patch issues without rethinking the overall design. I see a couple things I think we should fix:
save_ensemble_binary_file
method, for example, largely cloned this flawed code block. Creating a private method to handle this common task will regularize that behavior. {"net" : "TA:,
"channel_net" : "TA",
"sta" : "S22A:,
"channel_sta" : "S22A",
"chan" : "BHZ:,
"channel_chan" : "BHZ"
}
Allowing item 3 would fix this bug. Item 1 and 2 would just clean up the code base.
As noted I reran the offending workbook again with this change to define the database handle:
from mspasspy.db.database import Database
import mspasspy.client as msc
dbclient=msc.Client(schema='mspass_lite.yaml')
db = dbclient.get_database("getting_started")
I get the same behavior. That is very mysterious because I even looked at mspass_lite in the data/yaml directory and it does not have net, sta, chan, and loc set readonly. I am not sure why that is happening. @wangyinz you should be able to recreate this as I'm working with the current instance of the "getting_started" tutorial in the notebooks directory of the tutorial repository.
Oh, I see the problem. I don't think there is such a thing mspasspy.client.Client(schema='mspass_lite.yaml')
. There is no schema option in the client. This is indeed a design flaw as we don't have anything other than the default database constructor to specify
the schema. Currently, the schema should be set by the set_metadata_schema
and set_database_schema
methods. i.e.,
from mspasspy.db.schema import DatabaseSchema, MetadataSchema
from mspasspy.db.database import Database
import mspasspy.client
dbclient = mspasspy.client.Client()
db = dbclient.get_database("getting_started")
db.set_metadata_schema(MetadataSchema('mspass_lite.yaml'))
db.set_database_schema(DatabaseSchema('mspass_lite.yaml'))
Good, that explains why the workaround failed. What this did, however, remains a bug we need to eventually squash. I think we should, as I suggested above, use this as a reason to clean up the Database class implementation. We should probably not close this issue until that has been done.
@wangyinz and I had exchanged some email about this problem, but I now know the cause of the problem. I am leaving this entry here for the record. This is an urgent bug fix as it creates wf_TimeSeries entries that are pretty much useless. The problem was found in revisions to the getting_started tutorial. Within the current version of that tutorial is this block of code:
Noting this job is fetching waveform data from iris with obspy. Then all it is doing is converting the waveforms to mspass data objects and saving them to the database with the default use of the
db.save_data
function.The bug is that this context creates incomplete wf_TimeSeries documents. The symptom I saw was that the stock seed station codes "net", "sta", and "chan" where not being saved.
My code block above shows how I enabled pdb to find this error while running a notebook. I preserve that here as it is a very very helpful trick to debug a notebook. It actually goes way back and we need to add something about that usage in a new user manual section we need to produce on debugging a workflow.
With pdb I was able to figure out what happened. The logic of the following section of the save_data method is flawed:
where I pasted the section from pdb created with the list command. The problem is that in the context the workflow above we have this value for changed_key_list:
But note this line:
I get the same result for "net" and "chan". Because "net", "sta", and "chan" are not in the
changed_key_list
the above logic causes them to be erased from the data inserted to form the new document for the data being saved.I'm not quite sure how to fix this without breaking something else or I would blunder forward. This exists because of the need to handle the "readonly" data problem, but the logic here fails because in this context "net", "sta", and "chan" really aren't changed because they were defined during the construction of the data object and didn't "change" but were created.
This urgently needs to be fixed, but as I said I'm not sure what the right solution is that won't create a different problem. One thing I propose is that we should consider making "net", "sta", "chan", and "loc" special attributes. The four letter word of SEED is so locked into those four attributes they may need special designation.
The other element is that when I look a bit deeper I think the root of this problem is treating updates and adds for documents somewhat similarly in the Database class may be the root of this problem. That is, I think all metadata traffic in Database goes through the
update_metadata
method. Seems we ought to have a way to handle data more cleanly when the operation is "add" instead of "update". That is the root of this problem, I think.