simpeg / processing

software for time-series analysis, spectral analysis and transfer function computation from geophysical data
MIT License
6 stars 3 forks source link

Sketch of the library #1

Open lheagy opened 7 years ago

lheagy commented 7 years ago

There are 3 main modules we are considering at the moment: Time Series, Spectral and Transfer functions. We will focus in on Natural Source EM data to start. Below I have added notes from the conversation and google doc with @kujaku11, @grosenkj, @rowanc1, @sgkang, @dougoldenburg. In addition, it would be helpful to outline a sketch of pseudo-code: What should a workflow look like?

Time Series

Spectral Analysis

Transfer Functions

kujaku11 commented 7 years ago

A general work flow would look something like this:

That's the general idea, there are areas to improve and try different things out.

lheagy commented 7 years ago

Thanks @kujaku11. In terms of workflow - how iterative of a process would this typically be? Do you often iterate on a single step? or do a few steps in time series processing and then perhaps go back and remove a few more outliers?

lheagy commented 7 years ago

Here are a couple thoughts based on the ppt and notes above. There are 3 main classes: TimeSeries --> Spectral --> TransferFunctions. Each is instantiated with a dataset from the previous (eg. spectral = Spectral(time_series))

Here are a couple thoughts on what a TimeSeries class could look like - (very much a sketch at this point!)

A couple key points about the below:

class TimeSeries(BaseTimeSeries):  # top level this might be processing.NSEM.TimeSeries

    def __init__(hx=None, hy=None, hz=None, ex=None, ey=None, **kwargs):

        self._source_data = utils.create_data_frame(hx, hy, hz, ex, ey)  # if large, we can be smart about writing this to disk

    def __getitem__(self, key):
        return self.current_data[key]  # if you call self['ex'] looks at the current data

    @property
    def source_data(self):
        """
        Original data set. remains un-touched. 
        """
        return self._source_data

    @property
    def current_data(self):
        if getattr(self, '_current_data', None) is None:
            self._current_data = self._source_data  # if we haven't taken any processing steps yet, this is just the source data
        return self._current_data

    @property
    def processing_log(self):
        """
        This returns an ordered dict (or similar - needs to be serializable)
        of the processing steps taken, including their args, kwargs
        """
        if getattr(self, '_processing_log', None) is None:
            self._processing_log = {}
        return self._processing_log

    def remove_pipeline_noise(self, period):
        self._current_data = utils.remove_pipeline_noise(self.current_data, period)
        self.register_processing_step('remove_pipeline_noise', {'period': period})
        return self.current_data

A couple other comments

kujaku11 commented 7 years ago

Good outline of the time series data @lheagy.

Probably should also add a save_data() function so you can reuse filtered data.

The logging is a good idea, what about something like return_dict = {'step_01':processing step 1, 'step_02':processing step 2}

where processing step is the code snippet that produced that step, not sure how you do that but would be handy. You could then have a function that wrote out those steps in a script file. Could also have the return dictionary be its own object with methods of 'write_script_file, get_step, etc'

As for iterative processing on the time series, this is usually a one off, or a trial and error. Most of the time if you do a pretty good job of removing the obvious noise in the time domain, the frequency domain processing will take care of the rest. That's where most of the hard work is done. Time series processing is usually minimal, unless you have noise like pipeline noise where there are obvious periodicity structure.

lheagy commented 7 years ago

Also, for keeping track of units: https://pint.readthedocs.io/en/0.7.2/

ahartikainen commented 7 years ago

pint and pandas do not mix well.

If you want to use pandas and pint, then maybe 'read-in' function should convert everything to SI units.

lheagy commented 7 years ago

Ah, thanks @ahartikainen, I haven't played around with them together before. In this case, I don't think we will be making heavy use of pint, in a lot of ways, I see it being used more as a property on a class so we can do validation (not necessarily deeply integrating by attaching it to values or embedding it in a data frame. Agreed though that getting the data into SI on construction of a data class is a good approach.

lheagy commented 7 years ago

Here are a couple points that came up in the meeting today (see: https://www.youtube.com/watch?v=-X4HbcedBUY): @sgkang and @kujaku11 please correct me where I am off-base! and feel free to add

grosenkj commented 7 years ago

Thanks, @lheagy, @kujaku11 and @ahartikainen for your comments.

I want to add a couple of things to add to be considered for the design of the library. Like the direction of having the things object oriented and leverage/use tools in python to deal with the large amount of data that the time series can be. I also agree with the discussion from today's meeting on separating the processing steps into a SimPEG.Maps kind of structure, I think that would be a good way of sharing processing steps between different stations/locations. It also makes building a custom filtering out of multiple predefined filters.

My suggestions on the classes in the module:

A bit of my reasoning for having a single time series as the base makes the library much more versatile. As I understand the procedure, until the calculation of the transfer functions most of the calculations/operations are done individually on each component. So for performance, designing the library in that way should make parallelization easier if dealing with a list of base objects that can be sent out individually for the number crunching. Designing the filters to be SimPEG.Maps like would make it relatively simple and transparent to share filter steps between time series recorded at the same location. As well, with most data I have worked there are several (some non-continues) time series of different sampling rates for each of the component measured simultaneously (for example ~ 1000Hz/sec for 1min every hour; 100Hz/sec samples for 10min every hour; 10Hz/sec samples for the whole time). Then different frequency ranges are obtained from the different series. So being able to accommodate for these types of time series structures would be important.

As has been discussed, bookkeeping of the procedure (both of what is done but also where the transfers are calculated from in respect to the spectral and time series) is important. Being able to identify where in the time series an outlier is calculated from would be an effective tool to have.

Using pandas/dask data frames sounds like a great avenue for managing the time, spectral and transfer series making linking between the different objects effective.