Define core schema - Githubissues

andru commented 8 years ago

A space to discuss a core schema. Entries could contain additional data, but it would be useful to define a schema that all entries share.

@pmackay and @mstenta already got started at this over at https://github.com/openfarmcc/Crops/wiki/Crop-data-needs, but I'm opening this for some parallel discussion which we can port over to the wiki as we reach consensus.

andru commented 8 years ago

Here's my take on a minimal schema, divided by category.

Entity Identification

A unique identifier

Probably something arbitrary or an established scientific name. In the latter case, we'd have to choose which nomenclature.

Taxonomy and nomenclature

This section defines the crop being described for interoperability with other data sets

Taxonomy

Taxonomic hierarchy

An array of the taxonomic hierarchy, starting at the family level. e.g.

[Solanaceae, Lycopersicon, esculentum]
[Brassicaceae, Brassica, oleracea, italica]

Names

Scientific nomenclature

The names assigned to this plant by different nomenclatures. Relevent to a crop database these are ICN and ICNCP. There will often be a name for common plants in each nomenclature, and there will be synonyms. Keeping a list of all synonyms is out of scope, but we should list the common ones which some people continue to use. e.g. Broccoli

{
  accepted: [
    ['Brassica oleracea var italica', 'ICN']
    [Brassica oleracea Italica Group', 'ICNCP']
  ],
  synonyms: [ ] 
}

Common names

By language or locale

{
  en: ['Broccoli', 'Heading Cauliflower', 'Calabrese', 'Purple Cauliflower']
  en_GB: ['Spear Cauliflower']
  es: ['Brecol']
  pt: ['Brócolis']
}

Phenology

This section is going to be tricky to define a core schema for, because crops all have different triggers for various life cycle stages.

I'm more or less ok with the idea of a simple 'days to maturity' or 'days to germination' field as a shortcut to this data as long as we understand that such figures totally relative to the environment and the common numbers you'll find in English growing literature are usually for temperate climates and are totally inaccurate elsewhere.
At the least we should have the means to define those figures relative to different climates somehow. Ideally, though, I think it would be better to have a schema which represents the underlying phenology data which influences the number of days to a certain life stage.

Germination

Low/High temperature boundaries
Cumulative temperature (cold or heat)
Moisture

Fruiting and Flowering triggers

number of days
day length boundary (short day/long day preference)
cumulative temperature (GDD)
temperature boundary
soil moisture

Other phenological properties we might want to consider

leaf stages (industrial farming uses this as a predictive model for harvest timing)
dormancy
seed ripening
...

Environmental tolerance

genetic drought tolerance
minimum survival temperature
PH tolerance to thrive

Technique / unscientific data

I think crop spacing, yield and method of planting should be out of scope for this database. All three are strongly relative to personal preference and environmental conditions.

Ok, that's what I've got for now...

andru commented 8 years ago

Here's a rough example of a crop using something like the above schema

id: aTHneohsT£@23th
inherits: id-for-brassica-oleracea-entry
taxonomicHierarchy: [Brassicaceae, Brassica, oleracea, italica]
names: 
   - [Brassica oleracea var italica, ICN]
   - [Brassica oleracea Italica Group, ICNCP]
synonyms: [ ]
commonNames: 
  en: [Broccoli, Heading Cauliflower, Calabrese, Purple Cauliflower]
  en_GB: [Spear Cauliflower]
  es: [Brecol]
  pt: [Brócolis]
averageDaysTo: 
  harvest: [25, 90]
  flowering: [100, 110]
phenology:
  germination:
    - [45, Growing Degree Days Until Germination]
    - [20, Days Cold Stratification Required] 
  flowering:
    - [16, Ideal Maximum Temperature ºC]
    - [13, Minimum Hours Day Length]
environmentalTolerances:
   - [-12, Minimum Temperature]

Here's an example of why days-to type data is tricky... The data I was using for some of this entry is this:

May be harvested after 40-60 or 50-90 days, depending on variety. (Eswaran) 25-45 days for leaves and 100-110 days for seeds

So, harvest time for the whole plant is 40-60, or 50-90, or for just an outer leaf harvest it's 25-45. That means the full range for a single property is 25-90. Ouch! We could have that overridden to a more specific value in the cultivar entries, assuming we could get that data, and we could add an additional property like

averageDaysTo: 
  totalHarvest: [40, 60]
  partialHarvest: [25, 45]
  flowering: [100, 110]

Not sure how best to model it, and in large part it depends on what data we can get our hands on.

roryaronson commented 8 years ago

@andru this all looks really great to me. I agree that anything that is opinion such as crop spacing should be left out of this - that is what OpenFarm Guides or other apps bring to the table as their core offering.

Do you think the schema should include linking to the same entity on other sites?

Should we use the wikidata identifier when available?

pmackay commented 8 years ago

@andru @roryaronson how familiar are you with linked data? It would be beneficial to use existing properties where possible. For stuff like name, commonName, synonym, etc that should be possible. Would be useful to check what other schemas already cover some of the specific stuff.

Should we use the wikidata identifier when available?

Yes, please read this for a much better description. Although it does depend on what the user needs are, which it would be great to tease out into user stories.

Are any of these crop specific properties regional?

andru commented 8 years ago

how familiar are you with linked data? It would be beneficial to use existing properties where possible.

Only passing familiarity. I get totally confused when it comes time to choosing which ontologies to use to express which relationships. In principle, thouge, I totally agree. Anywhere where we can draw from established ontologies we should.

Furthermore, any datasets we can link to instead of maintaining our own data is totally desirable from my perspective, but this has to be balanced with ease of use for the end user. As long as it's open data we could pull it in to the flat-file with a compile step, so that the source entities link to, e.g., agrovoc for name data, but the final compiled output bundles and transforms that data for ease of use.

What property of which ontology would you use to model these kinds of relationships..?

entity > taxonomy defined by > http://oek1.fao.org/skosmos/agrovoc/en/page/c_1068 entity > nomenclature defined by > http://oek1.fao.org/skosmos/agrovoc/en/page/c_1068

Should we use the wikidata identifier when available?

As an additional ID I think for sure. As a primary ID this sounds great in theory, but wikipedia/base doesn't really cover many but the most well known cultivars, so that either lands us with the project to make sure it does, or not base our ID system upon it.

mstenta commented 8 years ago

@andru this is great! I agree with all your points. I will also add some thoughts:

First, I really think that we should be conceiving of this as a distributed model right from the start. This will require a discussion of its own, so I started a new issue for it: #7

If we take a distributed approach, we can drastically simplify the schema requirements that we start with - only the most widely agreed-upon properties - and then begin a formalized process of suggesting and adopting additional properties moving forward.

An example: if we start by providing a very bare-bones "base" dataset, with only a small set of common plants, then others can start to build their own derivative datasets on top of it (for specific cultivars, varieties, etc). And they can choose to include schema properties of their own creation (within a minimal framework that we provide, of course). Over time, as consensus is built, the standard schema definition can choose to adopt additional properties, when it makes sense to.

Let's discuss these ideas more in the other issue, but I thought it was worth mentioning here, as it might take some of the pressure off of the initial schema definition task.

In regards to your specific points:

I especially love the phenology section - and I agree that it is the trickiest. You sort of touched on one idea that might work: we could "peg" the average numbers to a specific climate, so it is known that they are numbers relative to that climate. With that as a reference, it might be possible to write conversion code that can translate those numbers to other climates. That would be in the domain of the app, though, not the schema. And again, maybe we can leave some of the phenology properties to the creators of derivative datasets in the beginning - but provide a container for them it at least to get it started.

A unique identifier Probably something arbitrary or an established scientific name. In the latter case, we'd have to choose which nomenclature.

A Universally-Unique Identifier will ensure that there are no potential collisions with crops or cultivars that someone else's database provides, and will make it more certain which parent crop is being referenced in the "inherits" property. There will also need to be a "dataset reference" property, as well, that points to the dataset containing the parent - if it's not the same dataset that the current crop is in.

I think crop spacing, yield and method of planting should be out of scope for this database. All three are strongly relative to personal preference and environmental conditions.

That's fine with me. But I think there should be some properties to define overall "size", like "height" and "spread" at maturity. Something that could be used to infer what's possible in terms of spacing... A mature basil plant is much different than a mature pumpkin plant.

And I'm still on the fence a little about "yield", but we can leave it out for now. I fully understand that yield is extremely variable depending on climate and growing conditions. But it is less variable when you consider it in relation to other crops. Again the basil vs pumpkin example: I can never expect to get the same amount of yield from a single basil plant as I can from a single pumpkin.

What I am hoping to achieve is to give the programs that use this data a real sense of the plants - even if it is just "average" information - it provides necessary knowledge that can be used to generate plans. User-submitted "Guides" are great, but I'm looking toward enabling automatically-generated guides... as much as is conceivable.

Great conversation everyone! I'm really excited about this!

andru commented 8 years ago

That's fine with me. But I think there should be some properties to define overall "size", like "height" and "spread" at maturity. Something that could be used to infer what's possible in terms of spacing... A mature basil plant is much different than a mature pumpkin plant.

Agreed. I'd also find this super useful. As I'm sure would FarmBot.

But it is less variable when you consider it in relation to other crops.

It's worth thinking about. Such a property can only hope to describe potential yield per area in ideal growing conditions which as you say, can be a ballpark figure to help someone calculate a total yield. It seems like it would need to be a property with ranges and contexts to be useful (to me). In line with the ideas of a distributed database, this property could from forming part of a DB which deals with more relative estimate figures used to advise growers as opposed to factual data.

mstenta commented 8 years ago

In line with the ideas of a distributed database, this property could from forming part of a DB which deals with more relative estimate figures used to advise growers as opposed to factual data.

Yes, and maybe we would even see "regional" databases start popping up - which derive from the core dataset, but override the phenological properties for their specific region.

Mageistral commented 8 years ago

I read some around problems in merging the data. If we want to minimize this problem, it'll be hard to avoid links to climates types, at least big families to reduce the range in stuff like germination or other durations. For example, only in France, there is 3 or 4 big climates types that can mess the ranges a lot. Only in a matter of being in north or south, you can delay seeding from 1 month ... Or dealing this point with a mean value with % (people know if they are living in a cold or hot place compare to the mean value), I think, or maybe not ...

mstenta commented 8 years ago

Yea, for the most part I think the data should be as climate-agnostic as we can - and leave those kinds of decisions (ie: when to seed) up to the application that's using the data. It's tricky to draw the line, but I am in favor of starting with a very small set of standard data properties and then allowing new ones to be discussed/debated and maybe added to the schema one-by-one moving forward.

Mageistral commented 8 years ago

That's the good way, to start with everything we agree as basis without a doubt. Can we use a data modeling tool ? The most interesting I found is dbdesigner.net with 2 database models 30 tables per model Or we can build and fork an SQL repository and import it in MySQL Workbench for example when we want to have a visual view of it ? Because it will be hard after a while to imagine the schema with YAML. I don't know how you're working generally. On my side I'm working alone most of the time so I don't have to think about sharing.

mstenta commented 8 years ago

We started sketching out the general objects and properties in a wiki: https://github.com/openfarmcc/Crops/wiki/Crop-data-needs

But that was just a first sketch/brainstorm, and hasn't received much attention or modification.

The idea behind using YAML or JSON was to make something that was database agnostic, so could be shared between many different software platforms.

But if mocking something up in MySQL is helpful to you, by all means! Ultimately it doesn't matter a whole lot what format the data is represented in, because it can be stored in a simple format and then imported into just about anything. I think that's the hope anyway.

Mageistral commented 8 years ago

Ok, I'll work on it this evening if "evening" has any sense in this international discussion ! At least we can easily agree, without any thought to implementation, only on what we want/need, you're right.

I'm thinking SQL because (I'm used to) and for example, when I see properties like "days to germination" and "days to maturity" my first though is to create tables "timings" and "timing types" to be able to add another "days to XXX" without touching to the DB structure. Maybe it's too much.

0xmichalis commented 7 years ago

Great discussion. Any news on this front?

You may want to consider following the OpenAPI spec for defining the API.

simonv3 commented 7 years ago

There hasn't been much progress on this as a separate entity, though OpenFarm has just been forging ahead with its crop endpoints.

mstenta commented 7 years ago

Not much progress yet - I've been focused on more general farmOS development recently - but it is still my plan to pick this back up when I get farther along, and hopefully we'll all still be open to sharing schemas/datasets at that point.

openfarmcc / Crops

Define core schema #5

Entity Identification

A unique identifier

Taxonomy and nomenclature

Taxonomy

Taxonomic hierarchy

Names

Scientific nomenclature

Common names

Phenology

Germination

Fruiting and Flowering triggers

Other phenological properties we might want to consider

Environmental tolerance

Technique / unscientific data