nilmtk / nilmtk-contrib

Apache License 2.0
114 stars 59 forks source link

Handle learning and prediction for multiple channels (active, reactive, etc.) #24

Open Wusir2018 opened 4 years ago

Wusir2018 commented 4 years ago

Hello everyone, I can run the algorithm with REDD. When I use AMPds data to test the algorithm, I get the error. There are pictures about my code and error. I think the timestamp of AMPds may cause this problem, but I don't know how to solve it. Could you help me? Thank you very much! code: 1 2

error: 3 4

eddie-LA commented 3 years ago

I have the same problem...

EDIT: I have tracked the issue down .When using only one power signature (e.g. only active power) it works fine. Notice the sizes of the dataframes.

Value of concatenated_df_app in api.py when using 1 power signature:

physical_quantity            power
type                      apparent
unix_ts
2014-01-30 00:00:00-08:00     54.0
2014-01-30 00:01:00-08:00     54.0
2014-01-30 00:02:00-08:00     53.0
2014-01-30 00:03:00-08:00     54.0
2014-01-30 00:04:00-08:00     53.0
...                            ...
2014-03-31 23:55:00-07:00     54.0
2014-03-31 23:56:00-07:00     54.0
2014-03-31 23:57:00-07:00     55.0
2014-03-31 23:58:00-07:00     54.0
2014-03-31 23:59:00-07:00     54.0

**[87780 rows x 1 columns]**

And this is the value of the index:

DatetimeIndex(['2014-01-30 00:00:00-08:00', '2014-01-30 00:01:00-08:00',
               '2014-01-30 00:02:00-08:00', '2014-01-30 00:03:00-08:00',
               '2014-01-30 00:04:00-08:00', '2014-01-30 00:05:00-08:00',
               '2014-01-30 00:06:00-08:00', '2014-01-30 00:07:00-08:00',
               '2014-01-30 00:08:00-08:00', '2014-01-30 00:09:00-08:00',
               ...
               '2014-03-31 23:50:00-07:00', '2014-03-31 23:51:00-07:00',
               '2014-03-31 23:52:00-07:00', '2014-03-31 23:53:00-07:00',
               '2014-03-31 23:54:00-07:00', '2014-03-31 23:55:00-07:00',
               '2014-03-31 23:56:00-07:00', '2014-03-31 23:57:00-07:00',
               '2014-03-31 23:58:00-07:00', '2014-03-31 23:59:00-07:00'],
              dtype='datetime64[ns, America/Vancouver]', name='unix_ts', length=87780, freq='60S')

However, there is a problem when using multiple, in my case 3 - active, apparent and reactive power. This is the error I get:

ValueError: Length of passed values is 263340, index implies 87780

If you pay attention, you will see that 263340 divided by 87780 is exactly 3!

Value of concatenated_df_app in api.py when using 3 power signatures - active, apparent and reactive power.

physical_quantity          power
type                      active apparent reactive
unix_ts
2014-01-30 00:00:00-08:00   37.0     54.0     18.0
2014-01-30 00:01:00-08:00   39.0     54.0     18.0
2014-01-30 00:02:00-08:00   38.0     53.0     17.0
2014-01-30 00:03:00-08:00   38.0     54.0     17.0
2014-01-30 00:04:00-08:00   38.0     53.0     17.0
...                          ...      ...      ...
2014-03-31 23:55:00-07:00   38.0     54.0     19.0
2014-03-31 23:56:00-07:00   38.0     54.0     17.0
2014-03-31 23:57:00-07:00   39.0     55.0     18.0
2014-03-31 23:58:00-07:00   39.0     54.0     19.0
2014-03-31 23:59:00-07:00   38.0     54.0     18.0

**[87780 rows x 3 columns]**

The index is the same. Hence, the error comes from this line: gt[meter] = pd.Series(concatenated_df_app.values.flatten(),index=index)

Using flatten shoves everything into 1 column, which is X times bigger than the index, for X = amount of power signatures you test on.

A solution (ugly, prone to breakage) is to repeat the index X times for the X power signatures. That is most likely not going to work very well, especially for graphs, but at least it can get the situation moving. This highlights a wider issue of using more than one power signature... @levaphenyl What do you think?

levaphenyl commented 3 years ago

@eddie-LA Thank you for the analysis! This looks like a bug in the class API indeed. The design question there is the expected output of the API. What is the metric value when predicting 3 different time series?

The current approach is to concatenate all time series, also in the regressor implementation of nilmtk-contrib. However this mixes values from different units (W, VA, etc.) which has little physical meaning. Keeping the channels separated could be an interesting approach, i.e. feeding matrices of shape (N, 3) instead of column vectors of shape (3 * N, 1) in the models. But then, how do you define the metrics? Do you expect 3 different metrics at the end of the experiment?

I see 2 cases for using the API with multiple channels:

  1. you are really interested in data fusion of active and reactive power and the effects on algorithmic performance, or
  2. you are using datasets where the mains were measured with a different unit than the appliances but you are only interested in the active power.

The API should handle both, but fixing case 2. is the most urgent in my opinion.

eddie-LA commented 3 years ago

I was mainly interested in 1 in my case. I wanted to experiment with the algorithms in NILMTK and how they fare against Makonin's Temporal Convolutional architecture here. He shows that reactive power is more useful as a disaggregating metric but since customers are billed for active power, it seems counterproductive to disregard it. It has its uses and the best results were recorded when the algorithm has access to several signatures anyway.

levaphenyl commented 3 years ago

Interesting research! If I understand it well, you need NILMTK and nilmtk-contrib to accept active and reactive power as input. As output the algorithms in your research are only infering the active power. Is that right? We thus have multi-channel in and single-channel out.

eddie-LA commented 3 years ago

Correct! Active power is the only usable output metric, but not the only usable input metric. Depending on the occasion, there can be other useful signatures to include.

I have a code snippet somewhere about flattening a matrix of matrices (my neural network at the time had this architecture, it's easy to visualize and work with) in order to feed the data through the optimizer and then all matrices were restored back to their original shape back at the network. Was it the most efficient code? No, but it works and it's not horribly slow, since it's numpy. Something of the sort can be useful here. I'll do some brainstorming. EDIT: That won't work here. The processing pipeline needs to be rewritten for that to be a solution, since we are dealing with a 1-dimensional vector from that point onward. Have a look at my code here to see how it's done in that paper. It's a fork with some fixes, since the original code is slightly... buggy.

Slightly off-topic: If rewriting the dataset loading and processing pipeline is somewhere in the schedule though, it might be prudent to first do that, even if it's a bigger task. Newer Tensorflow has faster numpy and dataset op-s built in here, AFAIK with automatic chunk handling. Here is a small 1-page tutorial of its basic capabilities. Obviously, we'd lose absolutely granular control but gain a lot of performance, smaller codebase, easier to maintain since it's closer to Tensorflow base etc...