simpeg / aurora

software for processing natural source electromagnetic data
MIT License
15 stars 2 forks source link

Merging Runs in Time Domain #152

Open kkappler opened 2 years ago

kkappler commented 2 years ago

Issue #80 proposes a solution to merging runs that is done in Frequency domain. This is a simple and fairly general solution but it does not allow us to take advantage of potentially longer period data that could be available if some small gaps were filled. This relates to issue #66

In general that would require time series processing and the merging of runs in Time Domain. This would probably require a new class MergedRunTS or something like that.

Time Domain Run Merging can be done according to one of two schemes:

  1. Nan-fill (or effective Nan fill
  2. Interpolate / replace with numeric data

In all cases, merging runs implies a the existence of a gap, and the gap will have either numbers or nans in the time series array, or could be designated by an undefined chunk, i.e. a discontinuity in an array.

Nan-fill is a nice, simple solution, that allows generically for numeric overwrite, without a structural modification.

Two things can go wrong with Nan-fill: A. The gap could be very large. We may then generate an absurdly long time series ... and possibly cause RAM problems. That could be solved by reading from the MTH5 on an as-needed basis, effectively chunking from one filesystem to another. Open a "receiver of FCs" h5 and then read-->process-->write until the job is done.
B. Nan in the time-series can cause issues with anti-alias filters (during decimation) or other issues in the time series processing. Standard workarounds for this involve replacing the gap (with zeros or an estimate), processing, and then assigning nan to data in STFT-land where there were gaps in TS-land since FC processing is robust to Nan

kujaku11 commented 2 years ago

@kkappler I've tried both ways before and had better estimates when the gap was filled with the median value of both sides of the gap. The FC's from the gap are usually tossed in the robust processing and it just seems like easier book keeping.

But the program I was using didn't have nan support so maybe that could be as simple as using a masked array? Which could be useful for down the road when the user is able to mask bad data from the time series viewer.

Xarray natively has a "fill" method and gives you the choice of nan or some other value.

Suggest having a variable for the maximum gap length to support, like 20 seconds or something related to the sample rate or number of samples to minimize absurd padding.