Metadata file formats - Githubissues

JackKelly commented 10 years ago

At present, NILM Metadata uses YAML to store metadata. I've been doing some research on alternative formats. (It's quite likely that we'll stick with YAML, though).

CSV

A lot of NILM Metadata is tabular. For example, for each appliance, we need to know the dataset_id and building_id the appliance belongs in, we need to know the appliance_type (fridge, toaster etc) and appliance_instance. This type of tabular data can be stored in YAML but CSV is considerably more efficient at storing tabular data.

CSV is rather unfashionable at the moment but there is some really interesting work on making CSV a better format. For example, Jeni Tennison at the ODI wrote a blog post on "2014: The Year of CSV". To quote her blog:

There is a sweet spot between developer-friendly, unexpressive CSV and developer-hostile, expressive Excel.

Formats such as the Simple Data Format (SDF) developed by OKF and the DataSet Publishing Language (DSPL) developed by Google sit in that sweet spot. They both define how to package CSV files with a metadata description (in JSON or in XML) that describes the content of the CSV files, and how they relate to each other.

Formalising and standardising the sweet spot is the role of the CSV on the Web Working Group which I am co-chairing with Dan Brickley. "

Advantages of CSV for NILM Metadata:

Simple, well supported format. Lots of software can load CSV (Python, R, MATLAB, Excel, DBMSes etc)

More space-efficient than JSON/YAML for tabular data

Makes validation very easy (just load it into Pandas with objects specified for each column and the loader will complain if wrong types).

We're thinking of storing metadata in tables within NILMTK anyway, so we could minimise the amount of code we need to write to import/export metadata if the on-disk format mimicks the in-memory format.
Disadvantages of CSV for NILM Metadata:

No especially elegant way to add extra fields for appliances. Could have an 'extra fields' string column and specify data using YAML/JSON. Or could have additional tables, one for each extra field.

No elegant way to represent lists (e.g. list of meters for appliance). Either use YAML/JSON in a field or have a separate table mapping from appliances to meters.

CSV is not as human-readable as YAML when using a simple text editor.

Lots of CSV 'dialects'. Although we could just standardise on RFC4180

some of our metadata definitely isn't a good fit for CSV. For example, describing the top-level dataset information would be pretty ugly to represent with CSV. So, if we did use CSV for some of our metadata, we'd still probably want to use YAML/JSON for other bits. Which perhaps adds complexity. Although there is something to be said for using the most appropriate format for each type of metadata.
HDF5

HDF5 can easily store metadata. We use HDF5 for storing meter data in NILMTK but we've avoided storing metadata in the HDF5.

Advantages of using HDF5 for metadata:

It works very nicely with Python. I'm pretty sure we can store any picklable Python object in HDF5. Modifying specific attributes should be easy.
If the data are stored in HDF5 then we can store everything (the metadata and data) together in a single file (although there are situations where this might actually be a disadvantage).
Disadvantages of using HDF5 for metadata:
HDF5 is not as easy to read/edit as simple text formats like YAML/JSON/CSV, nor is it as well supported by software packages (try loading HDF5 into Excel!).
XML

XML doesn't map well to in-memory data structures. It's also extremely verbose. It does have a mature schema definition language though. I'm not a big fan of XML.

Conclusions

I think the issue of "CSV vs YAML" boils down to one main question: should we prioritise human-readability over machine-readability? I think we definitely should.

CSV, even though it can be loaded by any spreadsheet program, might not be especially human-readable for NILM Metadata. The reason is that CSV doesn't support lists in fields, so we need to either use multiple tables (which makes it harder to humans to parse). Or maybe it doesn't. I'll do some experiments...

JackKelly commented 10 years ago

After spending the day playing with NILM Metadata, I have started to believe several things:

My existing proposal for NILM Metadata is too complex (it does lots of stuff that, frankly, I doubt will ever be required; thus it breaks YAGNI. I think we can do without the inheritance mechanism, and the concept of appliances containing components.
When people use NILM Metadata as a whole, they are likely to load it into some form of database (which, in the case of NILMTK, may just be a set of Pandas DataFrames). Hence it does make sense to store the data in a tabular form on disk
I'm starting to appreciate the simplicity of CSV, and the ease with which it can be edited in a spreadsheet.

As such, I've made progress on a new design for NILM Metadata. Here's the database layout (the diagram is a bit broken but most of it is correct... the tables in the 'Dataset' blue box would be shipped with each dataset; the other tables would be stored in NILM Metadata / NILMTK (and, ultimately, a semantic wiki, maybe)):

nilm_metadata_0 2 0-1_diagram

The very beginnings of the new design are in the CSV Simplification branch

This new design should actually require virtually no code to implement the design (unlike v0.1 which required quite a lot of code for all the 'concatenation' and 'inheritance' stuff). So, once the first complete draft schema has been designed (which is almost done), and I've tested the schema by converting my UK-DALE metadata to the new design, I can move back to full-time work on NILMTK refactoring. (which reminds me: with this new, minimialistic NILM Metadata design, it might make more sense to fold NILM Metadata into NILMTK)...

JackKelly commented 10 years ago

More ideas :

Use YAML, not CSV

Maybe use directory structure like this:

metadata/
    dataset.yaml 
    building1/
        building1.yaml 
        appliances.yaml 
        emeter.yaml

Stick with inheritance for common appliance data. Use simplification ideas from CSV schema in YAML.

JackKelly commented 10 years ago

For now, I think I should port the inheritance mechanism to nilmtk to suck in the common data for each appliance, but maybe keep the dataset metadata into hdf5 and just convert in the conversion script.

JackKelly commented 10 years ago

This to add to the schema:

which appliance is dominant on each meter
meters can be downstream of meters in other buildings

JackKelly commented 10 years ago

Just to confirm: I decided against using CSV. YAML seems a much better fit.

nilmtk / nilm_metadata

Metadata file formats #4

CSV

Advantages of CSV for NILM Metadata:

Disadvantages of CSV for NILM Metadata:

HDF5

Advantages of using HDF5 for metadata:

Disadvantages of using HDF5 for metadata:

XML

Conclusions