Closed JackKelly closed 10 years ago
After spending the day playing with NILM Metadata, I have started to believe several things:
As such, I've made progress on a new design for NILM Metadata. Here's the database layout (the diagram is a bit broken but most of it is correct... the tables in the 'Dataset' blue box would be shipped with each dataset; the other tables would be stored in NILM Metadata / NILMTK (and, ultimately, a semantic wiki, maybe)):
The very beginnings of the new design are in the CSV Simplification branch
This new design should actually require virtually no code to implement the design (unlike v0.1 which required quite a lot of code for all the 'concatenation' and 'inheritance' stuff). So, once the first complete draft schema has been designed (which is almost done), and I've tested the schema by converting my UK-DALE metadata to the new design, I can move back to full-time work on NILMTK refactoring. (which reminds me: with this new, minimialistic NILM Metadata design, it might make more sense to fold NILM Metadata into NILMTK)...
More ideas :
Use YAML, not CSV
Maybe use directory structure like this:
metadata/
dataset.yaml
building1/
building1.yaml
appliances.yaml
emeter.yaml
Stick with inheritance for common appliance data. Use simplification ideas from CSV schema in YAML.
For now, I think I should port the inheritance mechanism to nilmtk to suck in the common data for each appliance, but maybe keep the dataset metadata into hdf5 and just convert in the conversion script.
This to add to the schema:
Just to confirm: I decided against using CSV. YAML seems a much better fit.
At present, NILM Metadata uses YAML to store metadata. I've been doing some research on alternative formats. (It's quite likely that we'll stick with YAML, though).
CSV
A lot of NILM Metadata is tabular. For example, for each appliance, we need to know the
dataset_id
andbuilding_id
the appliance belongs in, we need to know theappliance_type
(fridge, toaster etc) andappliance_instance
. This type of tabular data can be stored in YAML but CSV is considerably more efficient at storing tabular data.CSV is rather unfashionable at the moment but there is some really interesting work on making CSV a better format. For example, Jeni Tennison at the ODI wrote a blog post on "2014: The Year of CSV". To quote her blog:
HDF5 can easily store metadata. We use HDF5 for storing meter data in NILMTK but we've avoided storing metadata in the HDF5.
Advantages of using HDF5 for metadata:
Disadvantages of using HDF5 for metadata:
XML
XML doesn't map well to in-memory data structures. It's also extremely verbose. It does have a mature schema definition language though. I'm not a big fan of XML.
Conclusions
I think the issue of "CSV vs YAML" boils down to one main question: should we prioritise human-readability over machine-readability? I think we definitely should.
CSV, even though it can be loaded by any spreadsheet program, might not be especially human-readable for NILM Metadata. The reason is that CSV doesn't support lists in fields, so we need to either use multiple tables (which makes it harder to humans to parse). Or maybe it doesn't. I'll do some experiments...