Open lheagy opened 7 years ago
A general work flow would look something like this:
That's the general idea, there are areas to improve and try different things out.
Thanks @kujaku11. In terms of workflow - how iterative of a process would this typically be? Do you often iterate on a single step? or do a few steps in time series processing and then perhaps go back and remove a few more outliers?
Here are a couple thoughts based on the ppt and notes above. There are 3 main classes:
TimeSeries
--> Spectral
--> TransferFunctions
. Each is instantiated with a dataset from the previous (eg. spectral = Spectral(time_series)
)
Here are a couple thoughts on what a TimeSeries
class could look like - (very much a sketch at this point!)
A couple key points about the below:
logging
the steps that are taken (including saving tuning parameters, etc). This would allow you to set up a processing workflow on a subset of data, write out the steps and then re-use / apply that same workflow on the whole data set. There is likely a better way of doing this that what I have sketched below, but hopefully that gives you a feel for the idea. class TimeSeries(BaseTimeSeries): # top level this might be processing.NSEM.TimeSeries
def __init__(hx=None, hy=None, hz=None, ex=None, ey=None, **kwargs):
self._source_data = utils.create_data_frame(hx, hy, hz, ex, ey) # if large, we can be smart about writing this to disk
def __getitem__(self, key):
return self.current_data[key] # if you call self['ex'] looks at the current data
@property
def source_data(self):
"""
Original data set. remains un-touched.
"""
return self._source_data
@property
def current_data(self):
if getattr(self, '_current_data', None) is None:
self._current_data = self._source_data # if we haven't taken any processing steps yet, this is just the source data
return self._current_data
@property
def processing_log(self):
"""
This returns an ordered dict (or similar - needs to be serializable)
of the processing steps taken, including their args, kwargs
"""
if getattr(self, '_processing_log', None) is None:
self._processing_log = {}
return self._processing_log
def remove_pipeline_noise(self, period):
self._current_data = utils.remove_pipeline_noise(self.current_data, period)
self.register_processing_step('remove_pipeline_noise', {'period': period})
return self.current_data
Good outline of the time series data @lheagy.
Probably should also add a save_data() function so you can reuse filtered data.
The logging is a good idea, what about something like
return_dict = {'step_01':processing step 1, 'step_02':processing step 2}
where processing step is the code snippet that produced that step, not sure how you do that but would be handy. You could then have a function that wrote out those steps in a script file. Could also have the return dictionary be its own object with methods of 'write_script_file, get_step, etc'
As for iterative processing on the time series, this is usually a one off, or a trial and error. Most of the time if you do a pretty good job of removing the obvious noise in the time domain, the frequency domain processing will take care of the rest. That's where most of the hard work is done. Time series processing is usually minimal, unless you have noise like pipeline noise where there are obvious periodicity structure.
Also, for keeping track of units: https://pint.readthedocs.io/en/0.7.2/
pint and pandas do not mix well.
If you want to use pandas and pint, then maybe 'read-in' function should convert everything to SI units.
Ah, thanks @ahartikainen, I haven't played around with them together before. In this case, I don't think we will be making heavy use of pint, in a lot of ways, I see it being used more as a property on a class so we can do validation (not necessarily deeply integrating by attaching it to values or embedding it in a data frame. Agreed though that getting the data into SI on construction of a data class is a good approach.
Here are a couple points that came up in the meeting today (see: https://www.youtube.com/watch?v=-X4HbcedBUY): @sgkang and @kujaku11 please correct me where I am off-base! and feel free to add
data & processing steps should be independent classes with a well-defined way to interface between them. In this way reading data in, handling large data sets, plotting, etc is handled by the data class, none of that complexity needs to be introduced in the processing steps
processing steps are small, lightweight classes, much like mappings in SimPEG (http://docs.simpeg.xyz/content/api_core/api_Maps.html). They can be composed (eg, remove_pipeline_noise * notch_filter
), and serialized. All of the tuning parameters are properties and can be serialized
Thanks, @lheagy, @kujaku11 and @ahartikainen for your comments.
I want to add a couple of things to add to be considered for the design of the library. Like the direction of having the things object oriented and leverage/use tools in python
to deal with the large amount of data that the time series can be. I also agree with the discussion from today's meeting on separating the processing steps into a SimPEG.Maps
kind of structure, I think that would be a good way of sharing processing steps between different stations/locations. It also makes building a custom filtering out of multiple predefined filters.
My suggestions on the classes in the module:
Building up the code having a single TimeSeries
class as the base element that:
Storage
class? Could have different backends)TimeSeriesMaps
for the time series filters like powerline (DC/AC), pipe, de-spike, etc filters
Spectral
classes connect with a TimeSeries
object to calculate the Fourier coefficients.TimeSeries
object the coefficients are obtained from'SpectralMaps` for spectral filterering
TransferFunctions
objects calculate the MT data.
Spectral
and TimeSeries
objects to use for the calculationsproperties
module to check the type for the given transfers before calculation of the data Maybe build some of weighting and coefficient removal for the transfer calculation as TransferFucntionsMaps
?
Having a Batch
class/extension where a collection TimeSeries
<--> Spectral
<--> Transfer
can be processed in the same way
A bit of my reasoning for having a single time series as the base makes the library much more versatile. As I understand the procedure, until the calculation of the transfer functions most of the calculations/operations are done individually on each component. So for performance, designing the library in that way should make parallelization easier if dealing with a list of base objects that can be sent out individually for the number crunching. Designing the filters to be SimPEG.Maps
like would make it relatively simple and transparent to share filter steps between time series recorded at the same location. As well, with most data I have worked there are several (some non-continues) time series of different sampling rates for each of the component measured simultaneously (for example ~ 1000Hz/sec for 1min every hour; 100Hz/sec samples for 10min every hour; 10Hz/sec samples for the whole time). Then different frequency ranges are obtained from the different series. So being able to accommodate for these types of time series structures would be important.
As has been discussed, bookkeeping of the procedure (both of what is done but also where the transfers are calculated from in respect to the spectral and time series) is important. Being able to identify where in the time series an outlier is calculated from would be an effective tool to have.
Using pandas
/dask
data frames sounds like a great avenue for managing the time, spectral and transfer series making linking between the different objects effective.
There are 3 main modules we are considering at the moment: Time Series, Spectral and Transfer functions. We will focus in on Natural Source EM data to start. Below I have added notes from the conversation and google doc with @kujaku11, @grosenkj, @rowanc1, @sgkang, @dougoldenburg. In addition, it would be helpful to outline a sketch of pseudo-code: What should a workflow look like?
Time Series
Spectral Analysis
Transfer Functions