popolo-project / popolo-spec

International legislative data specifications
http://www.popoloproject.com/
99 stars 18 forks source link

Document a CSV serialization #107

Closed jpmckinney closed 4 years ago

jpmckinney commented 8 years ago

Options:

Including the machine-readable column headers in the same file allows authors to use whatever human-readable column headers they like, without having to create/maintain a separate metadata file. An advantage of a metadata file is that a third-party can create the metadata file for content that it does not control.

For the purposes of self-publishing, HXL is likely a better choice, but for the purposes of making third-party CSVs machine-readable, W3C's tabular metadata is necessary.

Note that if HXL headers are used, it is straightforward to use that data to create a metadata file - so most of the toolchain could focus on the use of metadata instead of always implementing both options.

Tasks:

pudo commented 8 years ago

I'm really keen to work on implementation of this, see https://github.com/pudo/jsonmapping (the README example is Popolo). Overall, I think that an explicit mapping might be more appropriate for the complex data structure of Popolo, since deep nesting might be awkward to express in HXL, and it requires modification of the source files in a way that makes them more difficult to parse for non-HXL CSV readers.

jpmckinney commented 8 years ago

See also http://www.w3.org/TR/csv2json/ for performing a mapping. (Of course, if you can map to RDF, you can map to anything, right? ;-)

pudo commented 8 years ago

I looked at that a bit and it seemed to me like the resulting JSON is basically just a CSV table in tacky clothing -- it doesn't seem to support nested structures. Did I miss something there?

jpmckinney commented 8 years ago

If you look for the *-minimal.json examples, I think you'll find some JSON that real people are more likely to consume. The http://example.org/events-listing-minimal.json example has some nesting.

jmatsushita commented 8 years ago

I think there's a case for having both HXL like serialisation/validation and JSON Table Schema. I guess that in the simple case, each Popolo node (non relational entity) would have its own csv table. You could provide an alternate csv of doom serialisation with a column that references the entity type and add all the columns of all entities, so it would be a fairly sparse table (and incredibly inconvenient to navigate). Also just read the Open Contracting link that says it all.

But isn't there a problem when it comes to dealing with relationships? You can have a sidekick to the csv of doom, the edges of doom with all relationships style data (Membership, but also anything more complex than opengov:area - i.e. Area of the headquarters or multiple Areas of activity or area of responsibility...). If you had other relationship classes like Ownership or Contract you could also lump them all in a edges of doom csv, or break them down in multi tables too.

That doesn't solve the problem that those CSVs (especially the doom/doom one) are really hard to open and work on. But possibly the multi table node / multi table edges could be opened and (painfully) navigated and edited... But hey, serialisation maybe isn't meant to be for humans (I'd like it if it was too).

I know that if you look at it long enough any node looks like a relationship and vice versa, but if human readable serialisation is a concern then choosing which are your nodes and which are your edges matter to make sense of the world (and make it easy to output for all these node/edge graph viz importers out there).

Maybe its about canonical serialisations and potential variants. Maybe wanting round trip import/export to validating json is too ambitious.

I don't know. I'm ranting.

jpmckinney commented 8 years ago

I don't see any necessity for a CSV of doom with every class and property (except possibly as an import/export format that only machines will ever read/write), and I don't think anything above suggests its necessity. Do you? There's no problem with splitting the nodes/entities across multiple sheets that use a subset of the universe of possible column headings to keep things reasonable and human-usable. It's not a challenge to later merge the multiple sheets into a master document if that's needed for import/export.

For the sheets containing the relations (again, these can be split into multiple sheets based on the specific relation being collected), I've seen three approaches to identify the nodes being paired.

I don't think there is any more-user-friendly way of using plain old spreadsheets for nodes and edges. There's a reason CMS's are popular :-)

I also don't see how any of the above blocks round-trip import/expert to validating JSON. It's hard to disambiguate when/whether you are thinking out loud or honestly raising issues that you believe to exist.

jpmckinney commented 7 years ago

@michalskop interested in CSV serialization for data packages.

michalskop commented 7 years ago

my example: https://github.com/michalskop/datapackages/blob/master/vaa-sk-european-parliament-2014-parties-answers/datapackage.json

I am not sure it is really a good way, but I also need to keep it simple to be usable by non-tech people (e.g., political scientists)