kujaku11 / mth5

Exchangeable and archivable format for magnetotelluric time series to better serve the community through FAIR principles.
https://mth5.readthedocs.io/en/latest/index.html
MIT License
14 stars 3 forks source link

Filters not included for all channels #157

Closed kkappler closed 11 months ago

kkappler commented 1 year ago

This relates to aurora issue 252 Wide scale testing on Earthscope.

In an effort to better understand disagreements in TFs between aurora and SPUD, a closer look at station CAS04 was undertaken. CAS04 does result in an incorrect TF, and the reason appears to be because not all channel-runs are associated with filters in the mth5.

I verify that this problem can be reproduced by running the jupyter notebook: mth5/docs/examples/notebooks/make_mth5_driver_v0.2.0.ipynb

I modified the notebook on fc branch, adding two cells. Right after the the channel_summary_df is created, an iterrows operation is performed, and for each row of the channel_summary, the number of filters is counted, and recorded in a new column "n_filters".

Three entries in that table have no filters, channel ey in run c and channels ex,ey in run d. We may have encountered this before in aurora issue 31, however, I believe we worked around it with a custom XML provided by Tim. The metadata at IRIS look OK now though.

Using IRIS data services, we get this xml for which I manually checked that there do appear to be filters associated with every channel.

So, something is likely going wrong in the mth5 build. As an additional hint, when I build the "dataless" mth5 for CAS04, rather than having the expected number of rows (20, 5ch x 4 runs) there are 17 rows, and the missing rows correspond to exactly the rows that have no filters in the mth5.

kkappler commented 1 year ago

This change resulted in obtaining TFs that are in good agreement with SPUD on the CAS04 tests as well as no rows with zero filters in the channel_summary, so I think we want to keep the change.

@laura-iris : Could you please also take a look at the logic in get_inventory_from_df and see if it can be simplified somewhat ? Originally I had hoped to just iterate a single function over rows of the request_df, facilitating paralellization. It may be that simplifying the request_df before it hits get_inventory_from_df is appropriate (e.g. converting wildcards to explicit network-station-channel lists, and maybe other simplifications).

Some notes are in issue #157 and aurora issue 277

laura-iris commented 1 year ago

This reminds me a lot of some work I did last year, and when I was looking through my code and notes I see that what appears to be the exact same function used to live in make_mth5.py instead of fdsn.py. I had worked on looping over the epochs and simplifying some of the logic at that point. It seems like the changes weren't translated over to fdsn.py.

@kkappler can you take a look at this old merge request and let me know if it also looks like the same code? https://github.com/kujaku11/mth5/pull/106

kkappler commented 1 year ago

@laura-iris This does look like the same code (cleaned up and fixed already!) It looks like it makes sense to try and replace the existing get_inventory_from_df method in fdsn.py on fc branch with the fix you had made previously in in make_mth5.py, that had been merged into master

I don't completely understand it, but here is what it looks like happened:

In an effort to generalize make_mth5.py so that is can get data from clients other than earthscope/IRIS, @kujaku11 modified make_mth5.py to use a "client plugin model". In the current implementation there are two client plugins, fdsn.FDSN, and geomag.GeomagClient (for Intermagnet observatories). In future, other clients can follow this pattern.

The factoring of fdsn.py out of make_mth5.py was started on Feb 4, 2022, on the clients branch of mth5. The PR #82 for the clients branch was closed and merged into master on March 27, 2023.

However, your fix to issue #105 was merged into master on August 19, 2022 (while clients branch was still in development). At this point in time however, clients had never been merged into master, so you had no way to know that code had been factored out of make_mth5.py.

It looks like what happened is that somehow during this generalization process an older version of the data wrangling wound up in fdsn. It also looks like this had already gone pete tong on master by Nov 14, 2022.

kkappler commented 1 year ago

@laura-iris I took a look at the modifications you made to make_mth5 and have blended those into fdsn.py, together with some clean up I did as well.

I kept your structure of using self._process_list() which uses a dict to select based on mth5_version self._run_010 or self._run_020, and your idea of self._loop_stations() method. However, the logic in loop_stations uses my self.wrangle_runs_into_containers() method, which has a simplified logic from the old loop_statons.

I'll commit to fix_issue_157 and test this weekend.

kkappler commented 1 year ago

See mt_metadata issue 151

There is a bug in mt_metadata/mt_metadata/timeseries/stationxml/xml_channel_mt_channel.py causing multiple newly discovered filters to be renamed with the same name.

kkappler commented 1 year ago

Substantially modified the code (but not the ultimate approach) in mth5/clients/fdsn.py.

Replaced nested loop logic with alternative approach

This should be much easier to debug in future.

The only complexity is this idea that you can have namespace clashes with
network_id. We handle this (as per tronan's original version) by pairing network with a timestamp.

@laura-iris Can you take a look at my latest commit and see if it makes sense to you?

kkappler commented 1 year ago

@laura-iris and I discussed this today, the logic reproduces the old code, BUT, still does not generically support wildcards ... consider the case if a network code has multiple instances, in different time periods, then we only receive the zeroth instance.

This should be handled in a later SOW, and the logic for handling should be in either validate dataframe or in in the network handling (build_network_dict) where multiple start-times for a network (station) would be handled

kkappler commented 1 year ago

Filters are now correctly included for all channels, but are not showing up in the second run: image

kkappler commented 1 year ago

@kujaku11 This is seems to be because we need an additional operation in fdsn.py

I am not sure where the best place to put it is ... here is a snapshot of the main flow of make_mth5_from_fdsn_client

image

In this case, the inventory has only 5-channels (they never changed): image

but the streams implies two runs: image

experiment = translator.xml_to_mt(inv) is unaware that there are multiple runs, because it sees only the inventory, image

So we need to do one of the following:

I think that we can take a tack like (B) with a function called: repair_missing_filters(m) This would look for missing filters, and then if they are found, see if the same channel has filters on an earlier run.

kkappler commented 1 year ago

The following workaround is being tested, which at least allows me to move forward in the analysis of TF values (it is committed in sandbox/mth5_helpers.py on earthscope_tests branch of aurora) image

kujaku11 commented 1 year ago

@kkappler have a look at commit afc39f0. I added some logic if there are more runs found in the data than the metadata. It will use metadata from the 1st run and channels to populate metadata in other runs.

kkappler commented 1 year ago

@kujaku11 almost, but the both the first and second runs are named 001

image

Almost certain we just need to pop the "id" key from metadata dict before overwriting -- working on fix now

kujaku11 commented 11 months ago

Fixed with update of FDSN