Starting rough: a versioned JSON dump

andru commented 8 years ago

@simonv3 What are your thoughts on getting this repo rolling as a JSON file of crops?

I think you already did a bunch of data scraping from openly licensed sets for OpenFarm, and I've done the same for Hortomatic. I think we should get something rough rolling with this for now...

Proposal:

We dump our sources in this repo: open data sets, website scrapings, etc
Work on some simple command line scripts that combine those data sources and spit out a single JSON file per crop[1]
Work on some simple command line build tools to take the source files and spit out a big JSON crop dump
For now, edits to the database can be made by editing the individual source files and re-build

What do you think? Could this model work for OpenFarm for the time being?

@mstenta you mentioned you've already got some crop data going for FarmOS, could this model work for you?

1: By which I mean a taxonomic ID of some kind, not common names... species, variety, cultivar... there will be duplicates because horticultural naming is a mess, but it should get us close to something unique

simonv3 commented 8 years ago

The main issue I can think of is that we'd be splitting our dataset at this moment, and it would evolve separately on OF until this one becomes more usable as an endpoint. However, I don't think that we get enough edits at the moment for that to be a real concern.

We could also build some scripts that check a variety of sources and aggregate that data, then let humans check merge conflicts?

mstenta commented 8 years ago

@andru I like it! Getting started with at least a sketch is the best first step. And it will help to identify where the commonalities are.

I agree that each crop/variety should be a separate file. See my comment here: https://github.com/openfarmcc/Crops/issues/2#issuecomment-171755229

I would also suggest that we consider YAML instead of JSON. The Drupal community recently chose YAML over JSON for all of it's configuration management. I'm not familiar with all of the reasons (I'm sure there are lots of comparisons out there) - but one that stood out to me is that YAML can have comments embedded in it. I do love comments. :-)

mstenta commented 8 years ago

It looks like there are options for converting YAML to JSON in Javascript, as well. So I don't think it would be an impediment to JS-only apps. What do you think?

https://nodeca.github.io/js-yaml/

andru commented 8 years ago

I think YAML could be a good fit, since we're talking about hand-editing files for now. Less braces, but whitespace sensitive markup can be confusing for some people too. I don't have a clear preference. Whether we go with YAML or JSON for the source files, we should look into options for validation - maybe there's a github pull request integration that can handle it?

@simonv3 is there a way you could keep a log of changes made to the data at OpenFarm which, depending on the quantity, could be manually applied to the repo or scripted?

From the perspective of Hortomatic, in the short term I'll be using the data as read-only.

simonv3 commented 8 years ago

Making crops in OF read only for now is an option, but I'd have to discuss that with the other people working on the project. Thinking about it - the main editing that's been happening on crops is actually link to wikipedia and uploading images, all of which is not really "crop" information.

I'm personally cool with YAML.

pmackay commented 8 years ago

I'm curious, whats the goal? What will the list of crops be used for?

simonv3 commented 8 years ago

@pmackay The goal is to provide a bunch of services that use crop data with a consistent crop knowledge base. For example, both FarmOS, OpenFarm and Hortomatic would be able to draw from the same "crop" data set.

There's this issue, which attempt to answer those questions: https://github.com/openfarmcc/Crops/issues/2

mstenta commented 8 years ago

Hey everyone! I sketched up two quick proof-of-concept repositories, to demonstrate sort of what I'm thinking. It's not meant to be "final solution" - I just find it easier to get my ideas out in code sometimes. And maybe it can provide a starting point for further conversation.

The two repositories are:

https://github.com/farmOS/CropDB-Spec https://github.com/farmOS/CropDB-Base

CropDB-Spec serves as a place to define the data specification. It basically just has two files: cropdb.schema.yml, which defines the basic schema of a crop YAML file; and db/example.yml which is an example crop YAML file that contains comments about each field/value.

CropDB-Base serves as an example of an actual crop collection that implements the spec. I just added a single crop file called "tomato.yml" as an example, but we could start building out more if you like this approach.

The way I see "crop collections" is: perhaps we can provide a "base" collection that contains very general information about a set of very common crops. But other people could create their own sets for more specific ones - ie: seed producers could create sets that have files for each of their available varieties/cultivars. And they could use the "base" set as a starting point - utilizing the "inherits" field I proposed.

So for example, Johnny's Seeds sells a Tomato variety called "Big Beef" (http://www.johnnyseeds.com/p-7958-big-beef.aspx). In their data set, they could create a file called tomato.big_beef.yml (or something like that) and specify in there that it "inherits" from the base tomato.yml file. But they could also include a line in that file that overrides the "days_to_maturity" and set it to 70, because it's different from the default 60 defined in tomato.yml.

Again, this is all just a sketch - meant to convey some possible ideas and get your feedback. I haven't implemented any actual code to use these files, nor do I have much experience with YAML - so there may be things wrong - but hopefully it at least makes sense from a conceptual point of view.

What do you think?

mstenta commented 8 years ago

If anyone want's commit access to those repos, let me know! Feel free to bang on it, propose changes, etc.

Or, if it's completely different from what you're thinking - we can throw them out completely - but this is roughly what I am going to need in farmOS. :-)

simonv3 commented 8 years ago

I want to put a link to the datapackages set of tools here: https://www.npmjs.com/search?q=datapackage

http://dataprotocols.org/data-packages/

Your spec and implementation files reminded me of it @mstenta, and there's a group of well defined tools for this already - it's probably worth just reading up on them and seeing what they do.

pmackay commented 8 years ago

@mstenta would it be possible to start by capturing the models and properties you need? Separately from the data format? (wrote a bit more on here https://github.com/openfarmcc/Crops/issues/2#issuecomment-172157952).

mstenta commented 8 years ago

Thanks @simonv3 ! That looks like a good guide to follow and learn from! I'll spend some time familiarizing myself with it.

The format and structure I used for the YAML was loosely based on the format Drupal 8 is using for configuration storage. I'm sure there's some overlap in the concepts so it would be helpful to identify those.

@pmackay - Definitely! I agree starting a wiki to sketch out the properties is a good next step. So far, in the YAML sketch I made, the "crop" model looks something like this:

id: tomato
uuid: [uuid]
inherits: [uuid]
label: 'Tomato'
data:
    days to maturity: 60
    frost tolerance: not tolerant

Just a start... I'm starting to compile a list of other data properties that I plan to use. Should we start a wiki to compile them?

pmackay commented 8 years ago

Want to fill out more info here https://github.com/openfarmcc/Crops/wiki/Crop-data-needs?

mstenta commented 8 years ago

And just to be clear: my current use-case is specifically to build a set of files that can be imported into farmOS. Within farmOS, users will be able to plan out their plantings via a "Planting Wizard", which will use the data in these files to auto-generate tasks with specific dates. The "frost tolerance" and "days to maturity" that I included in the schema are both useful for that specific purpose.

@andru would be able to use these files for Hortomatic, as well. And OpenFarm.cc could use them as a basis upon which guides could be built. It would also help to accomplish your goal in #1 I think.

mstenta commented 8 years ago

Great! Thanks @pmackay - I will start adding more to that...

mstenta commented 8 years ago

@simonv3 - I really like how the datapackages format is put together. That would mean that the crop sets would be CSV files, too - which is good - lots of things can read CSV. :-)

Do you know if it can handle other formats too? Is YAML out? I don't really have strong opinions on the format at this point - just curious what the options are.

Question: is it limited to flat single-row data? In other words: if we discovered that we needed to represent nested objects somehow, or many-to-one relationships, do you know if that's possible with datapackages?

I don't know if that will be necessary - I suppose we'll see what comes together in https://github.com/openfarmcc/Crops/wiki/Crop-data-needs

roryaronson commented 8 years ago

Great conversation all!

"And just to be clear: my current use-case is specifically to build a set of files that can be imported into farmOS. Within farmOS, users will be able to plan out their plantings via a "Planting Wizard", which will use the data in these files to auto-generate tasks with specific dates."

^ This is pretty much what I need for FarmBot :+1: Though we we're hoping to use OpenFarm Guides as the main source of data.

roryaronson commented 8 years ago

Resources I've put together:

andru commented 8 years ago

Great to see the ball rolling!

That would mean that the crop sets would be CSV files, too - which is good - lots of things can read CSV. :-) I worry that CSV would tie us to a schema. The way I see things we need to define a core schema while not imposing a limit on additional data fields.

@roryaronson That scientific crop traits spreadsheet is great - what's the source ontology?

@mstenta I think the discussion over a common schema could use it's own issue, so I've started it off with my thoughts over at #5

roryaronson commented 8 years ago

@andru I don't remember anymore cause I made that list like a year ago. Its from a lot of sources cobbled together. I think I just googled "plaint traits list" and copy-pasted from like 100 places haha

mundotazo commented 8 years ago

I second using YAML. It's readble.

The USDA has CSV files for plants. http://plants.usda.gov/java/

If the seed varieties could be cross referenced with seed vendors it would be really helpful. http://www.organicseedfinder.org/

openfarmcc / Crops

Starting rough: a versioned JSON dump #4