tum-esm / em27-retrieval-pipeline

automated EM27 data processing
https://em27-retrieval-pipeline.netlify.app
GNU General Public License v3.0
6 stars 1 forks source link

Implement filtering and resampling function #42

Closed dostuffthatmatters closed 1 year ago

dostuffthatmatters commented 1 year ago

Our retrieval produces data like in the image below. For this data to be easily used by other models, we have to post-process it. A good example of the curves resulting from post-processing can be seen here. The raw data points are smoothed with a function similar to a rolling average and the output data is generated at a fixed rate, i.e. one output data point exactly every x seconds.

We need a new post-processing algorithm since the current one is way too complicated and error-prone. The goal is a function that takes in a pandas data frame with pairs of raw time and concentration values and a rate at which to resample the smoothed data and return a data frame in the same format:

def apply_post_processing(
    df: pandas.DataFrame,
    resampling_rate: Literal[
        '10 min', '5 min', '2 min', '1 min', '30 sec',
        '15 sec', '10 sec', '5 sec', '2 sec', '1 sec'
    ]
) -> pandas.DataFrame:
    pass

You can use the sample data below (same as in the image).

An example resampling implementation can be found here. However, this does not interpolate between data points - with data every 6 seconds resampling at 2 seconds will result in data gaps even though there were no recording gaps.

As a smoothing function (applied before resampling) you can use scipy.signal.savgol_filter(array, frames, order) with frames = 31; order = 3.

proffast-2.2-outputs-20220604

proffast-2.2-outputs-20220604.zip

dostuffthatmatters commented 1 year ago

It is important, that even though resampling should work at higher rates than the data input, there should be no resampling in input data gaps above a certain threshold (e.g. 1 minute).

Example, with i = input and o = output when resampling at twice the input data rate:

i   i   -   i   -   -   -   i   i   i   i
o o o o o o o - - - - - - - o o o o o o o
patrickjaigner commented 1 year ago

I could see a Python data class object be helpful for overall readability at the end. A lot of operations could be moved to an internal function like for example the python list object.

dostuffthatmatters commented 1 year ago

Hi @vyasgiridhar!

You can find an example for how to group measurements by minute and average them here (last line): https://github.com/tum-esm/automated-retrieval-pipeline/blob/df0d27ef843d14bd5f7fee124c41b937280fe098/extract-retrieval-data/src/procedures/read_from_database.py#L25-L90

However, this only works when the resampling rate is lower than the data rate. There has to be some interpolation in between. You don't have to write an interface to the database or any big codebase setup - only figure out a way for resampling incl. interpolation.

Best, Moritz

MarlonMueller commented 1 year ago

Hi, @vyasgiridhar (and @dostuffthatmatters),

I've refactored the repo up to a stage, where I now need to consult with @dostuffthatmatters about the next steps. Regarding the post-processing, as of now, you need to implement the following to resolve this issue. It's the same idea as above, just in an isolated function (so you don't need to worry about the DB request).

https://github.com/tum-esm/automated-retrieval-pipeline/blob/57213702b732df582b0cb5869092e0f826c41582/extract-retrieval-data/src/procedures/dataframe.py#L54-L78

Best, Marlon