Data format / markup language

andru commented 8 years ago

There's been some discussion in #4 and #5 about the data format/markup language to use to represent the data.

Currently the options presented are:

CSV
JSON
YAML
JSONLD

If anyone has a case to make for/against these or another to propose, here's the place!

andru commented 8 years ago

I think, given talk in #5 about linked data and using appropriate ontologies where available, JSONLD is the most appropriate of those listed. Aesthetically I prefer it to RDF/XML, and easy clientside crunching of JSONLD is a major plus. It's not my field so I can't make too strong of a case without a bunch more research... @pmackay?

mstenta commented 8 years ago

I think my preference would be either YAML or JSON/JSONLD, with maybe a slight lean toward YAML purely because it is both machine-friendly and human-friendly.

FWIW, I think I mentioned this before, but the Drupal 8 CMI (configuration management initiative) put a lot of time and thought into comparing data types, and settled on YAML. There was a LOT of debate and discussion (as is always the case when a huge project/community makes a big decision like that). I found this from 2011: https://groups.drupal.org/node/159044

mstenta commented 8 years ago

The human-readability of YAML is the biggest win, IMO. I think it will lend itself to faster adoption among non-techies.

I actually know a good number of farmers who know how to program. But I know a lot more who don't.

So the approachability of something like YAML could mean a big difference in whether or not we can get traction in these datasets, and get more people to start contributing/maintaining.

pmackay commented 8 years ago

My concern with this is what problems are created by using text files with a structured format? Its one thing to use JSON/YAML for defining config, metadata or schema files, its another to define 1000s of data files for plants. What happens if there is a need to change the format of the files? Is this easier or harder than updating a database structure? Does it limit the users who might interact with it?

What is gained by writing separate files?

If a database was used, it becomes easy to output data in any of the formats above.

If a file format is needed, I wonder if CSV with multiple sheets/files could be simpler? CSV is easier to edit even for non-techies, spreadsheets can be used.

Drupal 8 CMI generates the YAML files it uses. Its helpful cos it is machine readable, but typically most D8 devs wont write it. It does help with diffing though.

andru commented 8 years ago

It's difficult to talk about these things while there is discussion with significant consequences for implementation happening in other issues, but I'll answer some questions from my personal bias.

What is gained by writing separate files?

We can bootstrap a database now using Git for versioning, and GitHub for distributed authoring while we figure out the long term tech stack for the project.

Is this easier or harder than updating a database structure?

I would favour a document store like CouchDB for a crop database. Being able to deep-nest data natively is super useful for the kind of data we're dealing with, and document stores let you query on nested data easily. Ontop of that nailing a schema which applies to every crop is hard... There is a good chance of having cases where different groups need slightly varying schema.

If a database was used, it becomes easy to output data in any of the formats above.

It also means we need to build the tools to handle authoring and conflict resolution etc. In the long run, I think CouchDB would be a great fit for this project, but I was hoping to get things moving with flatfile and develop tools as we go.

I actually know a good number of farmers who know how to program. But I know a lot more who don't.

I don't think any markup is good in the long run. The long term goal for me would be an editing interface. Flatfile markup is a bootstrap to get us going with a database based on imports from existing data sources.

mstenta commented 8 years ago

Good questions @pmackay ... I'll add to @andru's responses:

What is gained by writing separate files? If a database was used, it becomes easy to output data in any of the formats above.

Files are the least-common-denominator, in a sense. By offering individual files, you set the barrier to entry very low. People with zero database or programming knowledge generally know what a file is. And storing canonical data in files doesn't stop you from also building a "true" database on top of them. For example: a MySQL database that serves the data from memory via a REST app - but pulls the canonical data from YAML files on the hard drive. And the app can serve the data in whatever format it chooses, because it is essentially designing it's MySQL tables schema to match the schema of the YAML files.

@andru you could import into a CouchDB and your app could use that.

I would import into my farmOS database (MySQL).

The point being: we can have ALL of the above! With files as the base. :-)

AND: nothing is stopping OpenFarm.cc (or another site perhaps) from creating a web-based UI for creating/editing the crop files - and then spitting them out as YAML for inclusion in whatever dataset the user wants (either a Git repo they maintain, or they can post a pull-request to another one).

What happens if there is a need to change the format of the files? Is this easier or harder than updating a database structure?

This is certainly valid. And it's one of the reasons defining a standard - and SIMPLE - schema from the beginning is important. As for changing schema in the future, if that becomes necessary, I think it can be handled with schema versioning. See my comment outlining a potential process here: https://github.com/openfarmcc/Crops/issues/7#issuecomment-172679203

CSV is easier to edit even for non-techies, spreadsheets can be used.

This is true - but the big disadvantages of CSVs, in my mind, are:

No concept of data types. With YAML we can define that a property is an integer, object, array, etc. This means we can validate that data in the app that uses it. Or provide tools/libraries for validation that apps can use. JSON is a little better than CSV in this regard, but JavaScript is an untyped language in general - so it's not as good as YAML. (see next two comments) ;-)
CSVs are essentially the same thing as a relational database. In order to have many-to-one relationships, you need multiple CSV files that reference on another. Why do that when it can all be represented in a single file?

Drupal 8 CMI generates the YAML files it uses. Its helpful cos it is machine readable, but typically most D8 devs wont write it. It does help with diffing though.

Yea, D8 CMI isn't a perfect comparison, because much of it is machine-generated. But one of the reasons they went with YAML was so that the config could easily be read and edited by hand, when necessary.

mstenta commented 8 years ago

JSON is a little better than CSV in this regard, but JavaScript is an untyped language in general - so it's not as good as YAML.

I'm going to take back this statement - if that's alright with everyone. It's not quite as simple as that and I don't think it added much to my argument. :-)

Ultimately, both JSON and YAML can be equally used to define intended data types, to give validators information about how to process the data. So in this case it's not really a big difference.

pmackay commented 8 years ago

No concept of data types

What about using JSON Table Schema and also Data packages if needed?

Why do that when it can all be represented in a single file?

There are simpler editors (any spreadsheet)

If the schema changes, its more complex to update many separate files than changing a few column headers in a table.

In order to have many-to-one relationships, you need multiple CSV files that reference on another.

This is a valid issue and if there is a need for highly structured data, could be a decent need for structured files.

I appreciate your points too, just discussing :)

mstenta commented 8 years ago

No concept of data types

If only I could go back in time... ;-)

andru commented 8 years ago

What about using JSON Table Schema and also Data packages if needed?

Data Packages looks good.

JSON-LD is becoming very widely used for API's. It's compatible with RDF vocabularies and can express anything RDF-XML can but in a package that's easier to work with on the clientside. What do you think of it?

mstenta commented 8 years ago

I'm curious about JSON-LD - never used it.

Data Packages looks good and follows a similar schema format to the one I proposed in http://github.com/farmOS/CropDB-Spec. The one concern I have is that it is limited to CSV data.

Mageistral commented 8 years ago

my_story Hello, I'm discovering your ideas in one shot, all over this repo, so, maybe I'm skipping some already-written information. This week, I was designing at home a DB for a similar project of yours, browsing the web after not as useful as I would like projects. So, yes, the starting point is a very useful, open, clear editable, centralized database and . Even on french forums I can see so much people trying to do stuff on Excel for some time. Or the websites are related on lucrative background. I was thinking about a very functional tool, more like hortomatic, where gardeners don't have to spend a lot of time but useful, practical, with very structured data, ability to design the garden with associations. end_of_my_story

I join the point about a very fixed structure, extensible in a second time but not at the first round. But keeping in mind extensibility. Maybe I have old-school point of view but I my mind the master DB must be an SQL relational database. And from it, should exist several ways of extracting the data (CSV, YAML, JSON, CSV, etc) and an web-API, mostly JSON for example. To design the DB, instead of creating hundreds repositories and files, what do you think about a shared database designer document, made for it. Do you have an online favourite tool ? I don't. It can also be done with an online excel-type webapp. I can manage its creation if you're ok with it (as I'm new), with no guarantee about finding a without-registration website.

I think having quite standard/simple technology behind could help to find help on projects, improving maintainability.

I would also like to say, because there is only a few people talking here and along this kind of unstable project, the focus should remain on the feasibility, every second. The second important focus is to be useful for the "final gardener" but it relies more on the final websites. For the DB project, the contribution should be very easy. Last point, I think the less final websites are social media style, more gardeners will contribute because they will not feel lost under informations, discussions.

openfarmcc / Crops

Data format / markup language #6