univie-datamining-team3 / assignment2

Analysis of mobility data
MIT License
0 stars 0 forks source link

Preprocessor.downsample_time_series_category() introduces NaN values #12

Closed rmitsch closed 6 years ago

rmitsch commented 6 years ago

To reproduce:

dfs = Preprocessor.preprocess([os.environ.get("KEY_RAPHAEL")])

cd = dfs[os.environ.get("KEY_RAPHAEL")]["trips"][14]
print(cd["sensor"].isnull().sum().sum())
all_sensors_resampled = Preprocessor.downsample_time_series_per_category(cd["sensor"],
                                                                             categorical_colnames=["sensor"])
print(all_sensors_resampled.isnull().sum().sum())
rmitsch commented 6 years ago

See Preprocessor._filter_nan_values(...) in d44a6a5064c7cdb2c340517e9df4df8a42a80232 for removal of NaN values/dataframes with large percentage of NaN values, if necessary.

Lumik7 commented 6 years ago

The pandas.resample() function is introducing the NaN values. It only produces those for the magnetic and acceleration sensor of two trips in @rmitsch trips. I couldn't figure out if it is a problem with the trips (maybe some floating point error?) or if it is a bug in the pandas.resample() function. I would recommend that the NaN values are dropped after resampling, because only very few records have this problem

The code below reproduces the error when only the pandas.resample() is used:

    token = os.environ.get("KEY_RAPHAEL")
    dfs = Preprocessor.preprocess([token])

    dfs[token]["trips"] = Preprocessor.convert_timestamps(dfs[token]["trips"])
    for i in range(len(dfs[os.environ.get("KEY_RAPHAEL")]["trips"])):
        print("trip: ", i)
        cd = dfs[os.environ.get("KEY_RAPHAEL")]["trips"][i]
        before_sampling = cd["sensor"].isnull().sum().sum()
        print("Total NaNs: ", before_sampling)

        accel = cd["sensor"]
        accel = accel[accel["sensor"]=="acceleration"]
        accel_resampled = accel.set_index("time").resample("S").mean()
        after_sampling = accel_resampled.isnull().sum().sum()
        print("Total NaNs: ", after_sampling)
Lumik7 commented 6 years ago

Update:

There is a lag in the recording of the data. pandas.resample is working as expected, it just filled the missing values with NaNs. See recording below, where left side is the resampled table and on the right side the original one:

bug

rmitsch commented 6 years ago

How about we replace those NaNs with either the last valid value or interpolate between last valid value before and first one after?

Lumik7 commented 6 years ago

Interpolation should be fine, but we should consider large lags maybe > 10 secs as invalid trips. On another thought if we really have to use only 30 sec cuts for the clustering, we could just skip the rows when cutting. On a side note, this issue is also related to issue #11.

rmitsch commented 6 years ago

Closed due to Preprocessor.downsample_time_series_category() being deprecated in favour of PAA.