Document a CSV serialization

jpmckinney commented 9 years ago

Options:

HXL: Add a header row with machine-readable column headers
W3C Metadata Vocabulary for Tabular Data: Have a separate metadata file that maps human-readable column headers to machine-readable column headers

Including the machine-readable column headers in the same file allows authors to use whatever human-readable column headers they like, without having to create/maintain a separate metadata file. An advantage of a metadata file is that a third-party can create the metadata file for content that it does not control.

For the purposes of self-publishing, HXL is likely a better choice, but for the purposes of making third-party CSVs machine-readable, W3C's tabular metadata is necessary.

Note that if HXL headers are used, it is straightforward to use that data to create a metadata file - so most of the toolchain could focus on the use of metadata instead of always implementing both options.

Tasks:

[ ] Determine how to map Popolo's RDF terms or JSON Schema properties to machine-readable column headers. OpenContracting provides one approach to mapping JSON Schema properties to CSV headers.
[ ] Write the documentation with examples
[ ] Provide some templates for common use cases

pudo commented 9 years ago

I'm really keen to work on implementation of this, see https://github.com/pudo/jsonmapping (the README example is Popolo). Overall, I think that an explicit mapping might be more appropriate for the complex data structure of Popolo, since deep nesting might be awkward to express in HXL, and it requires modification of the source files in a way that makes them more difficult to parse for non-HXL CSV readers.

jpmckinney commented 9 years ago

See also http://www.w3.org/TR/csv2json/ for performing a mapping. (Of course, if you can map to RDF, you can map to anything, right? ;-)

pudo commented 9 years ago

I looked at that a bit and it seemed to me like the resulting JSON is basically just a CSV table in tacky clothing -- it doesn't seem to support nested structures. Did I miss something there?

jpmckinney commented 9 years ago

If you look for the *-minimal.json examples, I think you'll find some JSON that real people are more likely to consume. The http://example.org/events-listing-minimal.json example has some nesting.

jmatsushita commented 9 years ago

I think there's a case for having both HXL like serialisation/validation and JSON Table Schema. I guess that in the simple case, each Popolo node (non relational entity) would have its own csv table. You could provide an alternate csv of doom serialisation with a column that references the entity type and add all the columns of all entities, so it would be a fairly sparse table (and incredibly inconvenient to navigate). Also just read the Open Contracting link that says it all.

But isn't there a problem when it comes to dealing with relationships? You can have a sidekick to the csv of doom, the edges of doom with all relationships style data (Membership, but also anything more complex than opengov:area - i.e. Area of the headquarters or multiple Areas of activity or area of responsibility...). If you had other relationship classes like Ownership or Contract you could also lump them all in a edges of doom csv, or break them down in multi tables too.

That doesn't solve the problem that those CSVs (especially the doom/doom one) are really hard to open and work on. But possibly the multi table node / multi table edges could be opened and (painfully) navigated and edited... But hey, serialisation maybe isn't meant to be for humans (I'd like it if it was too).

I know that if you look at it long enough any node looks like a relationship and vice versa, but if human readable serialisation is a concern then choosing which are your nodes and which are your edges matter to make sense of the world (and make it easy to output for all these node/edge graph viz importers out there).

Maybe its about canonical serialisations and potential variants. Maybe wanting round trip import/export to validating json is too ambitious.

I don't know. I'm ranting.

jpmckinney commented 9 years ago

I don't see any necessity for a CSV of doom with every class and property (except possibly as an import/export format that only machines will ever read/write), and I don't think anything above suggests its necessity. Do you? There's no problem with splitting the nodes/entities across multiple sheets that use a subset of the universe of possible column headings to keep things reasonable and human-usable. It's not a challenge to later merge the multiple sheets into a master document if that's needed for import/export.

For the sheets containing the relations (again, these can be split into multiple sheets based on the specific relation being collected), I've seen three approaches to identify the nodes being paired.

One is to add some machine identifier to the node tables, and to reuse that identifier in the relation tables; human-readable columns can be added to remind the user what node 54321 is, but that column would be ignored when the sheet is read by a machine, which will use the identifier column for linking to the node table.
Another approach is to add as many human-readable columns as are necessary to uniquely identify the node (e.g. name, birth date, national identity, etc.); this of course always produces issues in identifying the node, e.g. if a user changes a name in a node table but forgets to update it across all relation tables.
A third approach is to have both the node and its relations in one big spreadsheet, and to copy the node's data across multiple rows (or to leave those cells blank after its first "master" row, and pray no one ever re-sorts your table).

I don't think there is any more-user-friendly way of using plain old spreadsheets for nodes and edges. There's a reason CMS's are popular :-)

I also don't see how any of the above blocks round-trip import/expert to validating JSON. It's hard to disambiguate when/whether you are thinking out loud or honestly raising issues that you believe to exist.

jpmckinney commented 7 years ago

@michalskop interested in CSV serialization for data packages.

michalskop commented 7 years ago

my example: https://github.com/michalskop/datapackages/blob/master/vaa-sk-european-parliament-2014-parties-answers/datapackage.json

including other Popolo entities by ':' (motion:id, motion:name into vote_events table)

I am not sure it is really a good way, but I also need to keep it simple to be usable by non-tech people (e.g., political scientists)

popolo-project / popolo-spec

Document a CSV serialization #107