openfarmcc / Crops

Discussion on how to best manage a Crops data provider so that it's most widely useable.
Other
22 stars 1 forks source link

Centralized vs Distributed #7

Open mstenta opened 8 years ago

mstenta commented 8 years ago

Let's discuss whether or not the crops database should be centralized or distributed.

Centralized

mstenta commented 8 years ago

As I expressed in #5, I personally feel that we should think about this as a distributed system from the start.

We can define a core schema, and separately we can provide an initial implementation of this schema (a base set of crops).

Other groups can then build off of that base dataset if they want, or they can start their own.

Here are some of the benefits to this approach:

Here are some of the troubles we may run into:

andru commented 8 years ago

I'm totally agreed on the division of a large database into multiple small data-sets. I think there is a lot to be said for starting with just naming data and implementing everything else in discrete linked datasets.

A division and distribution of data along subject lines exactly the kind of thing I've been wanting for Practical Plants for the longest time, so I can transition all naming and environmental data to external open sources, and have Practical Plants just steward a data-set for it's niche: plant uses.

I'm not sure how with you I am on the idea of fragmenting the data within subject niches, though. If I understand you right you're proposing that I could have a crop database where I just list a few plants, and you can do the same, and it's up to the end user or some intermediary service to combine those sets (if desired). That sounds like a technical headache and huge possibility for fragmentation to me. What happens when we conflict? I say Tomato is X, you say Y, and the user who wants to combine our dataset is left in the position of resolving a conflict when all they wanted was an easy to use dataset.

Dividing the database into discrete units along subject lines (naming, phenology, uses, environmental, care, etc) with a linked schema could avoid the pitfall of conficts while allowing for communities to manage only the data relevant to them.

mstenta commented 8 years ago

What happens when we conflict? I say Tomato is X, you say Y

Haha! Finally a reason to post this on Github! "you say tomato, I say tomato" :-)

mstenta commented 8 years ago

Fragmentation is definitely a concern - but it may not be OUR concern. Over time I assume that some datasets will become more widely-accepted than others, and that will inform which pieces become "standardized". And the initial datasets that we provide will all follow the official schema.

Fragmentation would only potentially develop among third-party data sets. And even then, it can be kept under control with versioning.

Ultimately, it will be up to the apps using the data to decide which properties they need - and thus which datasets to use.

Dividing the database into discrete units along subject lines (naming, phenology, uses, environmental, care, etc) with a linked schema could avoid the pitfall of conficts while allowing for communities to manage only the data relevant to them.

I'd like to understand this better. Can you give an example?

andru commented 8 years ago

I'd like to understand this better. Can you give an example?

The core schema defined by this repo would provide just the data involved with identification:

No idea what that will look like in practical terms, but here's a rough example to communicate the idea...

{
    id: uuid
    sameAs: [http://wikipedia.org/entity, http://usda.org/entity, http://etc.com/entity]
    inheritsTaxonomyFrom: http://eol.org/32432974293
    inheritsTermsFrom: http://fao.org/foobar
}

Then, a plant uses database like Practical Plants could attach additional data to an entity by referencing the UUID in the core, using it's own namespaced ontology to prevent conflict, and steward it's own dataset which benefits from the interoperability with other datasets sharing the core schema.

{
    'additionalDataFor': uuid-of-entity-in-core
    'practical plants:hasPartsWithUses': {
        'structureOntology:stem': {term: edibleCooked, description:...}
        'structureOntology:flower': {term: edibleRaw, description:...}
    }
}

Or a crop phenology database would add phenological data:

{
    'additionalDataFor': uuid-of-entity-to-expand
    'phenologyData:floweringTriggers': ...
}

Each discrete database defining it's own ontology would prevent property name collisions.

Since each database is tasked with it's own niche, the only collisions that should exist will be between two datasets defining similar data. In that case there is no property value conflict in the case that they are merged; rather, the resulting crop database will contain duplicate data and it will be for the end user or intermediary to decide what to do about that. Whatever the case, there is no inherent 'winning' and 'losing' data as there is when there is a property name collision. The opportunity exists for both sets of data to co-exist.

I hope I've explained that well.

For example: a seed company can provide their own dataset that describes all the varieties they sell

So here's my fundamental problem with the model of fragmenting the database within a discrete domain. Say there are a lot of seed companies selling the same genetic material, Cucurbita pepo 'Yellow Crookneck', but each defines varying information on naming, spacing, or whatever other data they want to apply.

As a data consumer, there is value in multiple databases but also a lot of overlap... Where there is overlap, there is potential conflict. How do I decide who's data to use? It's impossible to merge without conflict resolution because each seed company is using the same schema but defining differing data.

Rather than create consensus during the data creation and supply a collectively edited database, we have created a fragmented data and put the responsibility on the data consumer to figure out which of the conflicting values to take.

I do see the benefit of this model in decentralising stewardship of data, but I'm not sure the payoff is enough to justify the cost. I think there are other avenues we could explore which allow for some parts of the schema to be stewarded without centralisation.

Fragmentation would only potentially develop among third-party data sets. And even then, it can be kept under control with versioning.

Could you explain a bit more your ideas on solving data collision with versioning? Isn't this going to make a bunch of forks?

mstenta commented 8 years ago

Yea - so I think we may be meaning different things when we say "fragmentation" - and perhaps it was my mistake to use that term originally, because it may have caused more confusion. I don't think it will be an issue in practice, but let me describe how I envision the "practice" first...

This example is going to be specifically with farmOS as a use-case, because that's where I'm approaching it from, but hopefully it will serve to illustrate what I'm thinking...

As a farmOS user, I will want to plan all the plantings for my season. One of the first steps in this process is to select the seeds I am going to purchase and plant. These seeds will be from a specific vendor, and they will have a specific variety name - which is often specific to that vendor (ie: "Big Beef (F1)" tomatoes from Johnny's Seeds).

I would import the crop data file for that specific variety into farmOS, and then use the data that comes along with it to plan the expected life cycle of that planting. From that point forward, that specific planting in farmOS will always be an instance of "Big Beef (F1)" tomatoes.

The actual data file for that variety would be provided and maintained by Johnny's, as part of a repository that they maintain - and they would update it every year - if they add new varieties, for example.

The data file for "Big Beef (F1)" tomatoes would extend a base definition of "generic tomato", and it would inherit a lot of the properties from that, but it would provide overrides of specific data, such as "days to maturity". Notice that on Johnny's website, they define the "Days to Maturity or Bloom" of Big Beef as 70 days: http://www.johnnyseeds.com/p-7958-big-beef.aspx

So far... there is no fragmentation occurring. Johnny's maintains their own repository of varieties, and they use a base set as a starting point, using the inherits property to reference the UUID of the generic tomato file.

Where "fragmentation" might occur (as I'm defining it) - is that Johnny's may also decide to add a property called "disease_resistance_codes", even though that property is not defined in the official schema. That's fine, and maybe Johnny's would use that property internally (ie: for auto-populating their website) and maybe other people would start to use it as well. But they would use it at their own risk - because it's not in the official schema.

Now, say another seed company (let's say High Mowing, for the sake of this example) does something similar, and they add a property called "disease_resistance". Not named the exact same thing... but with a similar purpose. This is sort of what I meant by fragmentation - different providers adding conflicting properties (but notably properties that are not in the official schema).

So... the way I see resolving those kind of inconsistencies over time would be something like this:

Essentially, it's a similar process to W3C standards adoption. Browsers implement the W3C standards (hopefully), and they add things on top. Then, as they come to agreement on what those things are, the W3C can decide to include it in the next version of the spec.

Phew! Sort of a verbose example... but hopefully that helps to clarify. What do you think?

andru commented 8 years ago

Thanks for that explanation, I've got a much better understanding of your proposal now.

I don't see a huge problem with what you're proposing related to the specific example you gave.

Your seed company example is insulated from direct conflict because their data is proprietary thus there is no overlap. The only possibility of collision is property name collision which can be handles by collaboration between the organisations and schema versioning, as you pointed out.

I think there are a lot more cases where different databases will make conflicting statements about the same entity. How do you propose that is handled?

E.g. MomnPop seeds co and PermaculturePlants both have an entry for Broccoli 'Purple Sprouting' but each list a different value for some property.

E.g. PlantPhenology.org and CropPhenology.org both have their own databases, and use a shared schema. The databases each have a large amount of unique entity data, but also a large entity overlap. Where entities overlap, the data is sometimes different.

I think your example of the big-ag seed companies with F1 hybrids is a valid use case and by defining an open crop schema we make it possible for other organisations to steward data sets with non-overlapping entities and the data consumer could integrate such data with other sources with minimal concerns for conflict.

I would hope that the model of each company/group/etc having their own data silo is discouraged and that co-orporation to build collective data sets would be the preferred and encouraged model.

What are your thoughts?

roryaronson commented 8 years ago

"As a data consumer, there is value in multiple databases but also a lot of overlap... Where there is overlap, there is potential conflict. How do I decide who's data to use? It's impossible to merge without conflict resolution because each seed company is using the same schema but defining differing data.

Rather than create consensus during the data creation and supply a collectively edited database, we have created a fragmented data and put the responsibility on the data consumer to figure out which of the conflicting values to take."

Multiple truths seem important in many cases. Even things generally regarded as "fact" such as spread may be unreliable due to different growing practices or soil. Multiple sources of data may be correct. And while I agree that cooperation should be encouraged to find "better" truths, this might not always be the best compared to multi-truth data. Allowing the data consumer to pick and choose from different sources, or view all of the data points seems valuable.

I see two user stories here: 1) Data consumer wants one data set to "win" because they like that data the most, and they're just using other sets to fill in gaps but never override. 2) Data consumer wants both data sets because they see value in having multiple data points for one property. They choose which data point to follow (or somewhere in between) upon the time of consumption.

Would it be possible to build a merging tool where one can specify the type of merge desired?

Another idea I had is that a data consumer could keep their data sources separate and then rank them. This might be a cleaner way to "merge" (ie: use multiple) data sources without actually merging them. Eg: I want values for the property Tomato Aroma. I search through my three data sets: A, B, and C and find values from just A and C. My ranking file says B > C > A. Because there was no value in the B source, I move onto C. The value from the C source is the one I use and A is disregarded. In this way, a data consumer can let their sources improve and manage their sets independently and avoid creating another set just for their app/need. Periodically, they can update each set from its origin.

A ranking file wouldn't have to speak broadly. It could look like this:

mstenta commented 8 years ago

I think there are a lot more cases where different databases will make conflicting statements about the same entity. How do you propose that is handled?

Well, the way I see it: that's perfectly fine. Ultimately, all crops will probably "inherit" from a base set of crops. So I guess my assumption is that some consensus would quickly build around who has the best/cleanest/easiest-to-extend "base" dataset - and that would become the one that most other datasets sit on top of (via the inherits: [UUID] property).

That's also why I proposed we (the ones who seem to be starting this thing) also provide our own version of a base dataset, alongside the specification of the schema. One that is simple and clean, but covers a good range of common plants. But, importantly, does not claim to be the "canonical" set.

That was my thinking in the two sketch repos I shared earlier:

This would define the spec: https://github.com/farmOS/CropDB-Spec

And this would be the first "base set" of crops that implements the spec: https://github.com/farmOS/CropDB-Base

If we keep the base set general enough, then we give everyone a reason to use is as a base - rather than create their own. So MomnPop Seeds, PermaculturePlants, PlantPhenology.org, and CropPhenology.org could all create their own datasets of crops that "inherit" from crops defined in that base.

And if someone else decides to start their own "base" set - that's their prerogative. But it is the choice of the rest of the community of users which base set (or sets) to inherit from. We can start an initial one, but maybe this will evolve and self-organize in a way that we can't predict. True, this means there can be some messiness, but maybe that's not a bad thing, and will ultimately lead to more involvement, more data, and more possibilities.

I really have no idea where it would go, but that's part of the excitement too. :-)

mstenta commented 8 years ago

Would it be possible to build a merging tool where one can specify the type of merge desired?

So yea - I guess I didn't see an immediate need to merge crops together, but that's just because I'm approaching it from farmOS - which will only be importing individual crops, and using them to build a catalog of "crops that I'm growing". So in my world, there's really no need to merge ever.

But if you are trying to build a wiki-like website that presents all crops ever, and you're trying to pull in every dataset that is ever created, then you might need to figure out merging... or maybe you don't.

But it seems to me that all of those things are tangential to the task of defining a schema and providing a base data set.

What you do with the data, and whether or not you need to merge things, is really in the domain of the app that's using the data.

And yea, if it makes sense to provide some tools alongside the schema and base data set to demonstrate how that can be done, that's great! Not necessarily something I need... but others might.

All great ideas, and I'm glad we're teasing them apart. :-)

roryaronson commented 8 years ago

It seems imperative that to have the data sets distributed and still be "linked", there needs to be ways for the data consumer to either merge sets or pull from multiple sets at once. Otherwise "distributed" would just mean silo'ed. And as Andru was saying, the task of merging or pulling from multiple places would be up to the data consumer or app developer to do. So to go with a distributed route, it seems we would have to build the tools one would need to fully utilize the distributed data. Otherwise we might as well just go one canonical, centralized source of data.

Does that make sense? Lol

mstenta commented 8 years ago

Yea that does make sense. At the very least, everyone using the data would need to be able to look up which crop another crop inherits from. And we would need to have some kind of property that describes where to find the parent crop, via URI.

So yea! I agree! Some basic tools are a must! I'd be happy to write the PHP implementations. :-)

pmackay commented 8 years ago

Distributed datasets are exactly what linked data, a.k.a. the semantic web, solves. If the data is in linked data stores its possible to easily do queries across multiple stores, combining data together.

andru commented 8 years ago

Distributed datasets are exactly what linked data, a.k.a. the semantic web, solves. If the data is in linked data stores its possible to easily do queries across multiple stores, combining data together.

Agreed wholeheartedly. Your experience with the Semantic web trumps mine. I know what I know through staggering through data to build Practical Plants and working on other Semantic MediaWiki projects which sit at the corners of the Semantic Web. I don't have a great overview. Please jump in and correct me if I say something dumb.

To me this is a question of how the data is distributed.

It's not a question of centralised vs distributed. It's a question of whether we distribute by slicing data horizontally or vertically. For this post I'm going to use those terms.

Horizontally: dividing up a table of entities by grouping rows (entities). Each unit has it's author(s) as its 'central' authority. Vertically: dividing up a table of entities by grouping columns (properties). Each unit has it's project group as it's 'central' authority


Horizontal slicing

The proposal to slice the data horizontally is that different databases under different ownerships define entities of the same type with equal authority. This is analogous to how taxonomy was performed for a long time before international efforts to unify the data; each national authority providing their own identification of plants within their territory.

There are some great things that come out of this concept:

There are, to me, a lot more negatives:

The proposal to slice the data horizontally adds an additional level to the concept: that each of those databases can be sharded throughout a network, with no central authority to manage entity duplication. Only a shared schema keeps the data compatible.

This is not so problematic if each entity is being used individually. It is potentially problematic when one wants to consider the data as a whole.


Vertical slicing

My preference is to slice data vertically, along the lines of knowledge domains.

There would be discrete database projects for phenology, nutrition, naming, etc. These databases would use linked data heavily so as to defer to existing authorities wherever possible. The emphasis is on making existing data easily consumable, not making authoring new data easier.

A build script would package data into an easy-to-consume flatfile database for data consumers. The data consumer could pick from the selection of databases to build from without fear of conflict because database namespacing ensures no property name conflicts and the nature of vertically slicing the data ensures there is no overlap in values.

This model allows for data forks, as GitHub forks, with the express aim that they are for work to be re-integrated into the main project, or that they exist only to amplify the scope of the original database (e.g. a crop edible uses database has a fork which adds information on edible wild funghi).

Multiple truths are still possible in this model, moreover those multiple truth are not unplanned conflicts but designed acknowledgments that a property value is contextual.


Our role: data providers or schema creators?

A good chunk of the data we are talking about already exists in databases across the web, some of them form a part of the semantic web, others do not but are open, others still are under copyright or lack an API. As a data consumer I can already draw on these sources to get the data I need, it just leaves me as the consumer with the work of finding the sources, figuring out what data I want from each, and handling conflicts.

That, I think, is my biggest concern with the vertical slicing model: what is our aim with this database?

Name data already exists in open form. There already exist open vocabularies for plant physiology; there are queryable resources for plant phenology data, genetics, crop diseases, etc etc.

If we are looking to create a horizontally sliced database to give flexibility to authors to work on their own data sets and still share a common schema, aren't we better off just defining as schema and using those existing sources of data?

The only argument I can see for actual data in this model is the centralisation of an ID authority, and I don't think we should aspire to be such an authority when the UN already has a very comprehensive list of cultivated crops at AGROVOC. Instead, we follow the lead of Growstuff and encourage authors to point to one or more existing authorities. If we want to design a shared crop schema we could, but again there are existing vocabularies for this which we could re-purpose or suggest.

That's an interesting project and if that's the consensus of what's wanted then I'm in, but it doesn't solve my needs from the perpective of Hortomatic or Practical Plants. For me it would be a long term goal and it should require working with some existing bodies who are working on these kinds of standards to get it right.


On merging

So yea - I guess I didn't see an immediate need to merge crops together, but that's just because I'm approaching it from farmOS - which will only be importing individual crops, and using them to build a catalog of "crops that I'm growing". So in my world, there's really no need to merge ever.

I can't speak to your use case. For Hortomatic, which deals with common garden vegetables and therefore a lot of heirloom genetics, I can easily imagine a situation arising from a sharded network of data where I, as the data consumer, see that I can get a great list of crops from the intersection of 4 different databases, but where those databases also have some overlap. When Tomato 'Fiorentino' is availble in 4 different databases, each of which has value to me, it's on me to resolve that, to figure out whose data 'wins' and potentially lose data from another, or to merge the 4 entries. That is not a job I want to do once, let alone multiple times.

Eg: I want to merge A with B, and B should win during conflict.

This makes the assumption B is always better than A, and doesn't allow for A being better than B at XYZ and B being better than A at GHI.

Eg: I want to merge D with E, and allow multiple values for a property.

This doesn't give me as the data consumer any context for those values. Without context, multiple truths are just noise. As Hortomatic, when a user plants Tomato 'Fiorentino' I have to tell them how long until germination, but the only thing I have is two contradictory statements...

averageDaysTo:{
   germination: {
    A: 10,
    B: [5, 30]
  }
}

when what I need to make use of multiple truths is useful context:

averageDaysTo:{
  germination: [
    {
      value: [5, 30], 
      context: { 
        dataSource: http://crowd-sourced.com 
        }
    },
    {
      value: 10,
      context: {
        environment: protected,
        conditions: {
          minimumTemperature: 8, 
          averageConstantTemperature: 25
        }
        dataSource: http://scientfic-observation.com
      }
    }
  ]
}

That lets me choose which data I show the user. If their growing environment matches the second then I can show them that. If not, I can show them a crowd-sourced range or average.

This is not a problem with the horizontally sliced model: it could be accomplished by imposing a schema which requires that all data is sources or provided some kind of context. It's solvable, but it puts a lot of overhead on each data author to provide context, and I'm skeptical how much people value contextualising their own data.


Ok that was a lot of text. Hope you all stuck with me, concise communication is a skill I lack :)

In summary...

A horizontally sliced database is ok with me in some limited circumstances (proprietary data about unique entities) but if it were the culture of this project to encourage horizontal slicing, it would not meet my use case for Hortomatic. It would leave me to worry about combining datasets. I am already combining datasets and resolving conflicts and it's no fun!

I think it comes down to this: My motivation is to simplify access to existing crop data, not simplify authorship of new or proprietary data.

Let me know what you all think...

pmackay commented 8 years ago

but it doesn't solve my needs from the perpective of Hortomatic or Practical Plants

Can you express your needs as a set of user stories, on a wiki page or new separate issues? Might help :)

Also its cool to know you worked on Practical Plants! I wanted to know, is there a data dump or API for that site?

Sounds like a common schema that meets people's needs would be a beneficial thing generally.

mstenta commented 8 years ago

Can you express your needs as a set of user stories

Here are a few that are important to me:

In the United States, "Certified Organic" growers are required by law to provide specific records of the exact crops that were grown, and where they were purchased from. There are similar record-keeping requirements in other countries.

The farmOS user story is essentially (sorry to repeat what I said above): a grower wants to import individual specific crops (ie: from a specific seed provider) into a catalog of "crops that they are growing". There is no need to merge data, or consider it as a whole, other than being able to "inherit" data from parent crops.

In the same vein, I can see this data format being used directly by seed producers/sellers as well. By storing their cultivar information in flat files, they can both A) share that machine-readable data with apps that want to use it, and B) use it themselves to auto-populate their websites, catalogs, etc. That's extremely useful, IMO, and there is not an existing standard already that is shared between producers and consumers of the data, as far as I know.

pmackay commented 8 years ago

In the same vein, I can see this data format being used directly by seed producers/sellers as well

This is quite a similar use case to what Open Referral is trying to enable with local health and human services data.

andru commented 8 years ago

I've added my user stories to #9. @mstenta could you copy yours over there too?

mstenta commented 8 years ago

Sure thing @andru - I'll write it up over there.

Mageistral commented 8 years ago

Sorry to ask this very genuine question here. Are you talking about providing the data through github repositories ?

mstenta commented 8 years ago

@Mageistral Thanks for your input on #6 !

I am suggesting that we "maintain" the data in a Git repository, yes. But that doesn't necessarily mean that the data will be "served" from there. It could be stored and served in a relational data, or any other means, really. But I do think that the canonical data sets should be managed in source control, ultimately. But that hasn't been decided absolutely yet.

Mageistral commented 8 years ago

Ok, to be able to have data sets in stable versions ? Uh, I'm wondering if it is compatible with a nice webapp for the administration of the data (because it was more or less my home idea). I would think that the data is supposed to get better and better, so what's the point with versioning ? Except for structural changes.

mstenta commented 8 years ago

Actually, I should clarify: I'm suggesting that we maintain the "schema definition" in a Git repository. And I am suggesting that we provide a very simple base data set, also in a Git repo, that other data sets can build off of. See my comment https://github.com/openfarmcc/Crops/issues/7#issuecomment-172679203 above which links to two example repos that I set up as a proof-of-concept. I don't really care how other people store their data sets - in Git or not. That's up to them. But Git provides more than just versioning - it also provides a rich history of changes, and Github provides the pull request workflow for introducing new ideas/discussions.

The key point I am making is that I don't think there should be a single centralized repository of crop data. I think we should encourage lots of distributed repositories, that build off of one another. That's the purpose of this issue ("Centralized vs Distributed").

Hope that clarifies my thoughts!

If you have a moment, would you mind adding your use case to: #9 ?

Mageistral commented 8 years ago

Ok I'll catch up the long posts. And I'm laughing because I wrote more or less the same idea around inheritance / overriding things in my user story !

mstenta commented 8 years ago

That's great! It helps to find where our needs overlap. :-) Thanks for the input!

Mageistral commented 8 years ago

About merging, I'm dealing with it with people information at work and I can just confirm that a good way is to set priorities around each data source/property like when you're explaining that B > C > D. Here, the collision detection should not be as crazy as for people. A variety name is more or less the same in every catalog, or it's another variety ... Another solution should be manual validation, your data merger can say I have the entry ID23 from A and ID29 from B that are the same variety but differs on xxx property. Then you choose which one is ok and you store the choice you've done for this specific collision.