How to represent a Building in memory?

JackKelly commented 10 years ago

We were thinking of using just a pandas.DataFrame (a 2D matrix indexed by timestamp) for storing a whole building worth of data. Columns would be labelled things like 'mains', 'kettle', 'fridge' etc.

But how would we answer questions like:

Does the kettle column store active or reactive power?

We could just use column names like 'kettle_active'.

How do we know which columns store mains power?

Complications:

multiple phases
split phases
datasets record different combinations of apparent | active | reactive | voltage

e.g. in REDD, we have two columns labelled 'mains'.

We need a standard way to programmatically figure out which columns hold aggregate data, and to know exactly what those columns record.

At the very least, perhaps we should use more descriptive and standard column names like 'aggregate1_active' etc.

Or should we use two DataFrames per Building: one for mains; one for appliances? Some datasets record mains data and appliance data at different sample rates, so using two DataFrames would make some sense. This would also allow us to easily add an 'appliances_estimated' DataFrame to each building to hold the NILM output for that building, and then we can really easily compare the ground truth to the estimates.

How to store metadata about each Building?

Metadata we might want to store:

geo location
number of occupants
nominal mains voltage
which room is each appliance in?
etc...

Some options:

Add new attributes to the DataFrame

df = pd.DataFrame([])
df.location = {'country': 'UK', 'postcode': 'SE15'}

There's some discussion of this on StackOverflow.

This seems rather fragile. The main problem is that many DataFrame methods (e.g. resample) return a new DataFrame without our newly added attributes.

Use a dict

building = {'aggregate': DataFrame, 
                'appliances_ground_truth': DataFrame, 
                'appliances_estimated': DataFrame
                'location': {'country': 'UK', 'postcode': 'SE15'}, 
                'nominal mains voltage': 230}}

Use a Building class

Use the same attributes as for the dict plus a bunch of useful methods on the Building class like:

get_vampire_power()
get_diff_between_aggregate_and_appliances()
crop(start='1/1/2010', end='1/1/2011'): reduce all timeseries to just these dates
plot_appliance_activity(source='ground_truth'): plot a compact representation of all appliance activity

None of these methods would be especially complex, and all could be implemented as stand-alone functions.

Any thoughts?

There's some fascinating discussion of storing metadata with DataFrames on the Pandas issue queue especially this post by hugadams (and the following answers) are particularly relevant to us. They recommend using a class.

Compared to a class, a dict is probably more simple but probably more fragile, less versatile and possibly less 'semantically appropriate' (by which I mean that, in the real world that we're modelling, the concept of a "Building" is prominent(!) and so it would make sense to use a Building class).

I have no strong feelings either way although I'm probably leaning slightly towards using a class.

nipunbatra commented 10 years ago

Quick question: What if appliance dataset has more than 1 field per appliance and both are relevant for some algo?

JackKelly commented 10 years ago

What if appliance dataset has more than 1 field per appliance and both are relevant for some algo?

You mean, for example, a dataset which might record both active and reactive power for an appliance? Then I think that can be handled by using column names like 'kettle_active', 'kettle_apparent' etc.

If there are multiple instances of an appliance then I guess we just do 'kettle1_active', 'kettle2_active' etc.

We don't have to use these complex column names. We could store some of this information separately in, say, a separate attribute of the Building class. I think I'd lean slightly towards using these highly descriptive column names like 'kettle2_active' because then the information is tied directly to the data and because it's clear what each column holds. But I have no strong feelings ;)

nipunbatra commented 10 years ago

Again a quick question: Say we load the dataset and it gets stored this way as you suggested above.

Now, If I want to do my analysis only on real power- Do I write some regexes to get the field names to extract and put in new structure and discard remaining stuff?

JackKelly commented 10 years ago

Very good point. I'd rather our users didn't have to do regexes ;)

Perhaps your quesiton is another argument for creating a Building class where we could have methods like get_appliance(name, measurement).

e.g. you could do series = building.get_appliance(name='boiler', measurement='active').

You could also do things like number_of_tvs = building.count_appliances('tv').

What do you think? Are we leaning towards using a Building class? I'm starting to lean in that direction ;)

nipunbatra commented 10 years ago

I think this is an interesting point which we tried to address in our SensorAct paper. I think the hierarchies and associated semantics raise interesting questions.

Overall, we have the following hierarchy:

+Dataset ++Building 1 +++ Mains ++++ Phase 1 +++++ Appliance 1

Available information varies a lot across datasets. I think this should be discussed over G+

JackKelly commented 10 years ago

I think this should be discussed over G+

ok; I've added it to our agenda.

jreback commented 10 years ago

FYI, these guys have sub-class of DataFrame that seems to work for them (and carries meta data) https://github.com/kjordahl/geopandas/issues/36

JackKelly commented 10 years ago

@jreback wow, thanks loads for the info! That's great that Pandas now allows us to sub-class DataFrame (and that the metadata hangs around); thank you. That certainly gives us more options ;) And thank you for popping over from Pandas; I love the community here on github ;)

@nipunreddevil and @oliparson: we should discuss whether or not we want to use multiple DataFrames per building (e.g. to separate aggregate, appliance_ground_truth and appliance_estimated). If we want to use a single DataFrame per building then we could go ahead and sub-class DataFrame, as per @jreback's link.

At the time of writing, I think I'm leaning towards using a Building class with separate DataFrames for aggregate, appliance_ground_truth and appliance_estimated. Happy to be persuaded otherwise though ;)

JackKelly commented 10 years ago

Nipun and I just discussed this and we're starting to think that we should use separate DataFrames for each appliance.

And also users may want to add extra information into the Building like movement data.

We decided that between now and Fri morning I should try to throw together some code which loads REDD into our in-mem data structure; which will give us something to rip apart when we chat tomorrow morning about the data struct.

For ref, here's Nipun's code for loading REDD: https://github.com/nipunreddevil/indic

JackKelly commented 10 years ago

a40064a51928b72cf296ed93b2168f303ae268a9

nipunbatra commented 10 years ago

Another thought after #25

What about the following hierarchy

Dataset +Building1 ++Utility +++Electricity +++Water +++Gas ++Ambient +++Motion +++Light +++Temperature ++Soft-Sensor +++WiFi

Soft sensor: http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6151374&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6151374

Basically, Electricity goes a level below Utility. Everything otherwise remains the same.

JackKelly commented 10 years ago

That does sound logical.

My only worry is that users will have to write quite long expressions to get at the electricity data... e.g.:

dataset.buildings['house_1'].utility.electric.mains

But I suppose we should go the more logical structure. What do you think?

(BTW, I suggested using the attribute name electric rather than electricity to minimise the amount of typing required whilst also making it clear what the attribute means... but I have no strong feelings about electric vs electricity as an attribute name!)

nipunbatra commented 10 years ago

I have kept electric for now. Electricity does sound big and people can make spelling mistakes :).

For now, the latest commit adds utility and ambient as additional levels. The access call is a bit longer, but clearer.

Commit URL: https://github.com/nilmtk/nilmtk/commit/b014ad1d1eb7599bc6766696f604fa05d0cd61c8

JackKelly commented 10 years ago

Sounds good ;) On 7 Dec 2013 10:11, "Nipun Batra" notifications@github.com wrote:

I have kept electric for now. Electricity does sound big and people can make spelling mistakes :).

For now, the latest commit adds utility and ambient as additional levels. The access call is a bit longer, but clearer.

— Reply to this email directly or view it on GitHubhttps://github.com/nilmtk/nilmtk/issues/12#issuecomment-30051882 .

nipunbatra commented 10 years ago

BTW, this should break your REDD example code, but should be real quick to fix.

On Sat, Dec 7, 2013 at 3:46 PM, Jack Kelly notifications@github.com wrote:

Sounds good ;) On 7 Dec 2013 10:11, "Nipun Batra" notifications@github.com wrote:

I have kept electric for now. Electricity does sound big and people can make spelling mistakes :).

For now, the latest commit adds utility and ambient as additional levels. The access call is a bit longer, but clearer.

— Reply to this email directly or view it on GitHub< https://github.com/nilmtk/nilmtk/issues/12#issuecomment-30051882> .

— Reply to this email directly or view it on GitHubhttps://github.com/nilmtk/nilmtk/issues/12#issuecomment-30051946 .

nipunbatra commented 10 years ago

From docstring in Building class,

Each value is a DataFrame shape (n_samples, n_features) where each
column name is one of `apparent` | `active` | `reactive` | `voltage`
and the index is a timezone-aware pd.DateTimeIndex```

However, AMPds dataset has energy terms too. This needs to be taken into consideration and must be revised!

Possibly, we need do something like

<quantity>_<subtype>
eg. power_active, energy_active

JackKelly commented 10 years ago

I'd advocate that our format converters should convert energy to power and that we should only store power in our dataset. What do you think? If AMPds stores both power and energy then I would imagine that the energy column is redundant?? (I haven't looked at the AMPds dataset!)

nipunbatra commented 10 years ago

Although we can do the conversion by selves, I would push for including data in as close to RAW form in the first step. This goes back to the discussion regarding different appliances having different times of usages. Does this additional information cause issues with querying?

On Sat, Dec 7, 2013 at 4:55 PM, Jack Kelly notifications@github.com wrote:

I'd advocate that our format converters should convert energy to power and that we should only store power in our dataset. What do you think? If AMPds stores both power and energy then I would imagine that the energy column is redundant?? (I haven't looked at the AMPds dataset!)

— Reply to this email directly or view it on GitHubhttps://github.com/nilmtk/nilmtk/issues/12#issuecomment-30052994 .

JackKelly commented 10 years ago

Yeah, it is a tricky question...

I think that if we can confidently convert from energy to power then we should do that. But, as you say, if there's any chance of messing up the conversion then we should store the raw data. (Obviously, the maths for converting from energy to power is trivial because power is just energy usage over time but things can go wrong when there are missing / corrupt samples).

With the AMPds dataset, it might be interesting to try converting from energy to power and then seeing if your calculated values for power agree with the power values in the dataset. If they agree then we can safely ignore the energy values in the dataset.

When I was tinkering with the HES dataset in my PDA code, the first thing I did was to convert it from tenths-of-a-kWh to Watts; but there were occasions where i questioned if the conversion had gone wrong.

Ultimately, from the perspective of the user, I think the ideal situation is if our data format (both in-memory and on-disk) uses a single unit for each physical quantity. But, of course, not at the cost of our converters possibly corrupting the data!

JackKelly commented 10 years ago

If we store the electrical wiring configuration in a graph data structure (which allows us to naturally represent arbitrary levels of meter hierarchies) and given that there isn't always a perfectly clean distinction between what's a "circuit" and what's an "appliance" then I wonder if we can simplify our Electricity class by removing the "circuits" DataFrame and renaming the "appliances" DataFrame to something like "submetered"? Then both circuits and appliance data would go into the" submetered" DataFrame and our "wiring" graph can represent arbitrary levels of meter hierarchies whilst reducing the code complexity. What do you think? It would be lovely if our code could easily work with any metering setup, from simple domestic to large commercial systems, without having to change any code ;)

nipunbatra commented 10 years ago

The graph data structure looks like a very good idea.

Regarding circuits v appliances, am not very sure. Maybe, this can be on our next hangout!

would be lovely if our code could easily work with any metering setup, from simple domestic to large commercial systems Ya. The Building class should allow for all types of buildings!

On Sun, Dec 8, 2013 at 2:14 PM, Jack Kelly notifications@github.com wrote:

If we store the electrical wiring configuration in a graph data structure (which allows us to naturally represent arbitrary levels of meter hierarchies) and given that there isn't always a perfectly clean distinction between what's a "circuit" and what's an "appliance" then I wonder if we can simplify our Electricity class by removing the "circuits" DataFrame and renaming the "appliances" DataFrame to something like "submetered"? Then both circuits and appliance data would go into the" submetered" DataFrame and our "wiring" graph can represent arbitrary levels of meter hierarchies whilst reducing the code complexity. What do you think? It would be lovely if our code could easily work with any metering setup, from simple domestic to large commercial systems, without having to change any code ;)

— Reply to this email directly or view it on GitHubhttps://github.com/nilmtk/nilmtk/issues/12#issuecomment-30077452 .

nipunbatra commented 10 years ago

Some questions which may help up in the design of this class

Give me all the sensors-utility, ambient in dining room?
Give me all the appliances connected to panel 1?

Broadly, I am thinking of it as Facebook Graph search.

nipunbatra commented 10 years ago

Few more optional fields to add

Age of building
Building automation (if used)
Area/No. of floors
Type: Commerical/ Residential

JackKelly commented 10 years ago

These are all good suggestions.

It feels like we could list quite a few attributes; and that users might want to add their own. I wonder if we should have a Building.metadata attribute which is a dict? e.g. we'd have building.metadata['type'] = 'commercial'

I'm honestly not sure which I prefer: using a building.metadata dict or using separate attributes. Using separate attributes makes the code look prettier; and will help enforce the use of standard attribute names; and users are free to add new attributes (although we need to be careful that those new attributes are copied in any Building methods which return a copy of the object.) On the other hand, using a building.metadata dict allows us to tidy away all the metadata into a single object within Building.

What do you think?

JackKelly commented 10 years ago

Another question that arises from thinking about how to import tracebase (see issue #39)...

Lots of buildings have multiple instances of the same class of appliance. We have something like 4 computers in our home. My parents have two fridges. And an industrial / commercial building might have thousands of appliances of the same type.

In the current design, we're planning to handle this situation like this:

building.appliances = {'fridge1': DataFrame, 'fridge2': DataFrame, 
                       'computer1': DataFrame}

But maybe it would be more elegant to instead use a dict of lists of DataFrames... something like this:

building.appliances = {'fridge': [DataFrame, DataFrame], 
                       'computer': [DataFrame]}

(where we'd use standard names for each appliance; and use a separate dict to map from specific appliances to rooms and sensor classes)

This would make it much more simple to find, for example, all 'fridge' appliances.

What do you think?

JackKelly commented 10 years ago

Some updates:

Should we have separate `appliances` and `circuits` dataframes.

Above, I argued that:

If we store the electrical wiring configuration in a graph data structure (which allows us to naturally represent arbitrary levels of meter hierarchies) and given that there isn't always a perfectly clean distinction between what's a "circuit" and what's an "appliance" then I wonder if we can simplify our Electricity class by removing the "circuits" DataFrame and renaming the "appliances" DataFrame to something like "submetered"?

I am now leaning back towards thinking that we should have separate dataframes for mains, circuits and appliances (as per the code at the moment).

If we have a single appliance channel to which multiple appliances are connected (e.g. a single meter which records both the tv and dvd player) then how about this: Pandas lets us use a tuple as a column name so we can just use a tuple of ApplianceNames.

Store building metadata in a `metadata` dict

I think I'm leaning towards storing data like the date the building was built, number of floors etc in a metadata dict; but we should define all the valid keys and values somewhere. maybe in a docs/standard_names/buildings.json file?

Use a dict of DataFrames or a dict of lists for appliances?

In the comment immediately above, I had argued for using a dict of lists for appliances. I've gone off this idea and think we should stick to a simple dict of dataframes.

Using NamedTuples as column names

We had previously discussed using stings like tv_1_active to mean this channel records data from the building's first TV and it records active power data. Using strings like this makes the column names human-readable but would mean that we'd have to use regexes (or similar) to process these column names. Instead, I wonder if we should use namedtuples (I have checked an Pandas allows us to do this):

In [31]: ApplianceName = namedtuple('ApplianceName',
                                    ['name', 'instance', 'measurement'])

In [33]: df = pd.DataFrame(columns=[ApplianceName(name='kettle', 
                                                  instance=1, 
                                                  measurement='active')])

In [34]: df
Out[34]: 
Empty DataFrame
Columns: [(kettle, 1, active)]
Index: []

In [35]: df.columns[0]
Out[35]: ApplianceName(name='kettle', instance=1, measurement='active')

What do you think? I've gone ahead and specified this proposal in more detail in appliance.py.

nipunbatra commented 10 years ago

Store building metadata in a metadata dict

Both ways fine with me, either class properties or dict.

If we have a single appliance channel to which multiple appliances are connected (e.g. a single meter which records both the tv and dvd player) then how about this: Pandas lets us use a tuple as a column name so we can just use a tuple of ApplianceNames.

I think this can complicate stuff. Or am I missing something? I guess a single appliance meter sensing multiple appliances like TV+DIsh is always an issue, even when we disaggregate! I think we need to see, if the entire pipeline till disaggregation and metrics would be fine with such tuples of ApplianceNames

Using NamedTuples as column names

Brilliant! I think column names should be immutable, so this is fine.

Another thing to ensure would be to see if we can support queries which we were earlier thinking of: Find list of columns:

All columns measuring active power
All attributes of kettle etc

Some brilliant designing!

JackKelly commented 10 years ago

Store building metadata in a metadata dict or as class attributes

I think that perhaps I'm leaning back towards using class attributes; and for just use the bare minimum for now. I don't have strong feelings either way, though ;)

Using tuples of ApplianceNames as column names

I think this can complicate stuff.

Yeah, it definitely does ;) But some NILM ground-truth datasets do use the same individual appliance meter for multiple appliances (for example, we use a single meter for both our toaster and our toasted sandwich maker) and I can't think of a better way of representing this situation in our design. This element of the design is complicated but that's because the world we're modelling is complicated ;)

I think we need to see, if the entire pipeline till disaggregation and metrics would be fine with such tuples of ApplianceNames

I agree. So, shall we go ahead and use tuples of ApplianceNames and if it breaks stuff downstream then we'll have to think again.

Some brilliant designing!

Yay to collaborative designing! I think we're getting close to having the main design in place ;)

JackKelly commented 10 years ago

While I remember:

So, we are planning to represent three different physical quantities in our buildings: energy, voltage and power.

We also want to have a single file which defines all the valid column names.

But it would also be nice to have a clean way to select all the "power" or "energy" columns.

So maybe our measurement.json file could look something like this... this would both define the valid column names and also capture the relevant hierarchy:

{
"power": [
          {name="active",   units="watt", abbreviation="W"},
          {name="reactive", units="volt-ampere reactive", abbreviation="var"},
          {name="apparent", units="volt-ampere", abbreviation="VA"}
          ],
"energy": [
          {name="active",   units="kilo watt hour", abbreviation="kWh"},
          {name="reactive", units="kilo volt-ampere reactive hour", abbreviation="kvarh"},
          {name="apparent", units="kilo volt-ampere", abbreviation="kVAh"}
          ],
"voltage":[
          {name="voltage",   units="volts", abbreviation="V"},
          ]
}

(it's very likely that I've gotten some of the electrical engineering specifics wrong; this is just an illustration)

This would allow us to label our columns as we had previously planned (e.g. 'active' or 'apparent' or 'voltage' etc) and then easily ask questions like "get all power columns". And would also allow us to easily output human-readable data like "90 kWh". I wonder if there's also a way to encode conversions in there... maybe specify conversions from SI units (Joules, seconds, coulombs) so that we can convert from kWh to joules easily?

JackKelly commented 10 years ago

one other quick thought... I'd suggest that we specify numpy.float32 for storing power measurements (that's what i've always used). I can't think of why we'd need the huge range and precision of float64 (the default, IIRC).

nipunbatra commented 10 years ago

Makes sense. Does this also reduce computation time? I would think it does?

JackKelly commented 10 years ago

I think it will have a small beneficial effect when using normal CPU FPU and it will have an even larger effect if we ever use SIMD or GPU ;) single-precision maths is often about 20x faster that double-precision on a consumer-grade GPU ;)

nilmtk / nilmtk