nilmtk / nilmtk

Non-Intrusive Load Monitoring Toolkit (nilmtk)
http://nilmtk.github.io
Apache License 2.0
827 stars 458 forks source link

power_series returns a data frame with NaN when using the parameter sample_period #523

Open haderazzini opened 8 years ago

haderazzini commented 8 years ago

Hello,

I'm trying to load the data frame of the submeters using power_series, but it brings a lot of NaN:

In[10]:

ukdale=DataSet('../../data/ukdale.h5')

ukdale.set_window(start='2014-11-03',end='2014-11-14')

ukdale_elec=ukdale.buildings[1].elec

ukdale_elec=ukdale_elec.select_top_k(k=10)

print('')
print('')

for meter in ukdale_elec.submeters().meters:

    df_meter=meter.power_series(sample_period=1).next()

    if df_meter.isnull().values.any():
            print(meter.label())
            print('number of NAN')
            print(df_meter.isnull().sum().sum())

            print('number of NAN without sample_period')
            df_test=meter.power_series().next()
            print(df_test.isnull().sum().sum())
            print('---')

Out[10]:

    53/53 ElecMeter(instance=54, building=1, dataset='UK-DALE', site_meter, appliances=[Appliance(type='immersion heater', instance=1), Appliance(type='water pump', instance=1), Appliance(type='security alarm', instance=1), Appliance(type='fan', instance=2), Appliance(type='drill', instance=1), Appliance(type='laptop computer', instance=2)])

    Washer dryer
    number of NAN
    27
    number of NAN without sample_period
    0
    ---
    Kettle
    number of NAN
    11957
    number of NAN without sample_period
    0
    ---
    Fridge freezer
    number of NAN
    7
    number of NAN without sample_period
    0
    ---
    Boiler
    number of NAN
    100
    number of NAN without sample_period
    0

How can I fix it?

Regards

oliparson commented 8 years ago

I think that by passing the sample_period parameter, you're asking NILMTK to resample the power series to 1 second resolution. If you don't want to resample, simply don't pass the parameter as in your example.

haderazzini commented 8 years ago

Hi Oliver,

Thank you for the quick answer. However, I need to resample. I believe power_series should have a method to fill the NAN.

JackKelly commented 8 years ago

I haven't checked the dataset but I suspect the problem is not with the code, but rather in my dataset, UK-DALE :)

If you use meter.power_series() (i.e. without specifying a sample_period) then you'll get back the raw data. If there are gaps then these won't appear as NaNs. Instead gaps in the data simply won't be represented. There will be no rows of data when no data was recorded.

If you force NILMTK to resample to 1Hz by passing power_series(sample_period=1) then any missing data will be represented as NaNs.

There is no perfect way to fix this. If the data is missing then the data is missing ;)

It might be best to hunt around for a period of time in the dataset when there are fewer gaps in the data. See Figure 3 in my paper.

haderazzini commented 8 years ago

I also believe that the problem is in the dataset. However, I think that power_series should have the option to ffill or bffil, like in pandas. I think the option to fill the NaNs will make more easy to work with missing data.

JackKelly commented 8 years ago

I also believe that the problem is in the dataset

Sure. Although, just to be clear: pretty much all datasets have missing data. It's not just UK-DALE :)

I think that power_series should have the option to ffill or bffil, like in pandas.

It does (although I admit that it's not well documented). You can pass a resample_kwargs dict:

        resample_kwargs : dict of key word arguments (other than 'rule') to 
            `pass to pd.DataFrame.resample()`.  Defaults to set 'limit' to 
            `sample_period / max_sample_period` and sets 'fill_method' to ffill.

See the ElecMeter.load() docs.

NILMTK does, by default, forward fill sample_period / max_sample_period samples.

If you want to forward fill the entire gap then do something like power_series(resample_kwargs={"limit": None}) (I haven't tested this. And I would recommend not forward filling the entire gap. Instead I'd zero-out large gaps in UK-DALE's appliances)

haderazzini commented 8 years ago

That is true, the majority of datasets have missing data.

I'm making ffill because I want to make a virtual main meter using the sum of appliances. The NaNs in appliances make appears NaN in the virtual main meter. Do you have any good idea how to avoid it?

JackKelly commented 8 years ago

I'd still recommend zeroing-out the NaNs in the appliance data. Something like power_series(resample_kwargs={"fill_value": 0}) might work. (not tested)