kujaku11 / mth5

Exchangeable and archivable format for magnetotelluric time series to better serve the community through FAIR principles.
https://mth5.readthedocs.io/en/latest/index.html
MIT License
16 stars 7 forks source link

Earthscope Bartington magnetometers Exceptions when building MTH5 #207

Open kkappler opened 6 months ago

kkappler commented 6 months ago

Network: IU Stations: SSPA, WCI, HRV.

There may be multiple issues here. The first problem is that the instrument "type" in the EarthScope metadata is not being recognized by mt_metadata. This is compunded somewhat by inhomogeneous values for sensor type:.

Around line 96 of xml_channel_to_mt_channel.py mt_channel = self._sensor_to_mt(xml_channel.sensor, mt_channel)

xml_channel is an object of type: obspy.core.inventory.channel.Channel

The _sensor_to_mt method is expecting to find something valid in xml_channel.sensor.type.

We could probably work around this with a _triage_sensor_type() function that looks at the other information. In the SSPA case, xml_channel.sensor.description has value: 'Bartington 3-Axis Fluxgate Sensor', so we should be able to deduce this -- however, for WCI, HRV, the sensor.type is "Bartington".

Basically, when sensor type is None or is not recognized, we should assign it if we can. This has the possibility of getting pretty messy, since we have no control on how Earthscope (or another client) chooses to assign these labels, but we could add, on a case by case basis handlers for this.

Maybe

understood_sensor_types = ["magnetometer", "induction coil", "coil", "dipole", "electrode"]
if sensor.type not in understood_sensor_types:
    logger.warning("detailed message")
    logger.info("Will try to deduce sensor type")
    sensor.type = _deduce_sensor_type()

TODO: add example request dataframe and code example

kkappler commented 6 months ago

The issue is in mt_metadata and I started a branch there.

Once the instrument type is being correctly identified, the next problem is that the time metadata is not correct.

This may be a problem at earthscope.

At least for one of the cases I encounter: around line 109 in xml_channel_to_mt_channel, as keys are looped

xml_key "end_date", which maps to mt_key "time_period.end" the value is None (the station is presumably continuously monitoring), but since the value is None, we do not update the channel_metadata.

xml_key "start_date", which maps to mt_key "time_period.start" the value is 2023-05-13T00:00:00.000000Z.

The result is that the data end_time remains the default 1980-01-01 and the start time is in 2023.

Later, around line 118 of xml_inventoty_to_mt_experiment.py, this conflicted channel metadata is passed to the mt_run.

Later this creates an exception in mth5/groups/run.py in the add_channel method. Note that the data could be passed to add channel and we maybe could avoid the error, but the metadata would still be wrong

The problem happens around line 404 when the data size are estimated:

 estimate_size = (
                                    int(
                                        (
                                            channel_metadata.time_period._end_dt
                                            - channel_metadata.time_period._start_dt
                                        )
                                        * channel_metadata.sample_rate
                                    ),

The result of this is a negative value for estimate_size, which is not a valid shape for the dataset that gets created.

So the root problem is probably that the end time is 1980-01-01, but it is strange that we have never encountered this before.

The solution maybe to assign a default time_period end to a date far in the future rather than far in the past.

kkappler commented 6 months ago

After modifying mt_metadata xml_channel_mt_channel.py, the non uniform descriptions of the Bartington is OK, at least for network=IU, station=SSPA network=N4, station=E62A

IU-SSPA also builds and calibrates (although the DC-levels are removed).

N4-E62A does not build -- the error is in the stream parsing. IN this example I requested data from April 8, 2024, for all three components and get these streams:

N4.E62A.40.LF1 | 2024-04-08T00:00:00.000000Z - 2024-04-08T10:59:17.000000Z | 1.0 Hz, 39558 samples
N4.E62A.40.LF1 | 2024-04-08T10:59:23.000000Z - 2024-04-08T18:51:59.000000Z | 1.0 Hz, 28357 samples
N4.E62A.40.LF1 | 2024-04-08T18:59:43.000000Z - 2024-04-09T00:00:00.000000Z | 1.0 Hz, 18018 samples
N4.E62A.40.LF2 | 2024-04-08T00:00:00.000000Z - 2024-04-08T10:47:48.000000Z | 1.0 Hz, 38869 samples
N4.E62A.40.LF2 | 2024-04-08T10:59:23.000000Z - 2024-04-08T18:51:26.000000Z | 1.0 Hz, 28324 samples
N4.E62A.40.LF2 | 2024-04-08T18:59:43.000000Z - 2024-04-09T00:00:00.000000Z | 1.0 Hz, 18018 samples
N4.E62A.40.LFZ | 2024-04-08T00:00:00.000001Z - 2024-04-08T10:55:56.000001Z | 1.0 Hz, 39357 samples
N4.E62A.40.LFZ | 2024-04-08T10:59:23.000000Z - 2024-04-08T18:53:33.000000Z | 1.0 Hz, 28451 samples
N4.E62A.40.LFZ | 2024-04-08T18:59:43.000000Z - 2024-04-09T00:00:00.000000Z | 1.0 Hz, 18018 samples

These start and end times show that there are two ~8s gaps in LF1 near 1100h, which may be a problem, but a more serious issue is that the start time for LF3 is a microsecond later than it is for LF1, LF2.

A general fix for this seems like it may be tricky.

Obviously, the solution in this case is to round the 2024-04-08T00:00:00.000001Z start time to 2024-04-08T00:00:00.000000Z, but how to do this programatically?

I suppose we could try to bucket the times that are within a few microseconds of each other ... but that feels hacky. We could , for example,

start_stamps = np.array([pd.Timestamp(x) for x in start_times])
np.diff(start_stamps)
Out[13]: 
array([Timedelta('0 days 00:00:00.000001'),
       Timedelta('0 days 10:59:22.999999'), Timedelta('0 days 08:00:20')],
      dtype=object)

This will show that there are two time stamps very close together (not expected), and I suppose if the difference between teh time stamps is less than a sample interval, we can drop one of them, but which one to drop? The one that is a "rounder number" also feels pretty hacky ... but that might be the way to do it for now.

SO, to do that, we would iterate over adjacent pairs of start and end times, and drop one if they were very close together.

BUT things can get worse still ... Note that the gaps do not line up, as if the digitizer was not synchronously sampling the data. We have never encountered this case before.

LF1's first stream is 10h59m17s in duration, but LF2's is 10h47m58s, and LF3 10h55m56s.
To handle this we may need to introduce nan, or drop the non-synchronized samples entirely, neither of which is a nice option. Also, in either case, we would probably need to modify the stream objects, either by packing nans into them, or by dropping the samples that are not common to all channels.

For the record here is the link to the station XML

kkappler commented 5 months ago

TODO: These incoherent streams should be investigated to determine if the inconsistencies in the timestamps are coming directly from the Earthscope source, or if this is happening on obspy.