Define goals [meta] - Githubissues

andru commented 8 years ago

Seems to me we might discuss a little what our aims are for the project in order to inform the technical discussion and plans for implementation...

Aside from OpenFarm, who is the database for?
What needs might users have beyond the needs of OpenFarm?
What's the scope of the project... e.g. common agricultural crops? cultivated edible plants? non edible crops? etc.

It would be good to try and get some interested parties together to go over some of this, to make sure we're building something which serves the broadest base possible within the confines of something achievable.

simonv3 commented 8 years ago

Thanks for taking this initiative @andru!

cc @elf-pavlik?

Here's my attempt at some answers.

I see the database being for anyone who wants access to crop data in machine readable form. There's a lot of "encyclopedia" type resources out there, but their machine readability is not that great. I also hope for an easily traversable graph.
Things I imagine using such a data source for: plant identification and basic research (a la pokedex). Cross referencing it with other sources (so maybe it points to for example wikipedia articles). Integration with search engines.
I know it's called "crops", but for it to be the most useful I don't see why we would limit ourselves to things that are deemed edible by society and whether they're actually plants? Do we include livestock? Especially if they have some sort of significance in how you cultivate other crops (simplistic example - chickens keep bugs down).

elf-pavlik commented 8 years ago

Worth keeping in mind: http://aims.fao.org/agrovoc

tkeifer commented 8 years ago

When thinking about goals, I usually try to answer these questions...

What problem am I trying to solve?
Why do I think its a problem? (evidence to support need)
What makes my attempt better than others?

Applying this methodology I think @simonv3 is onto a good start. We think there is a need for a more machine-friendly database of (at least) plant data, but we should take it a step further to gauge how much effort to put into it. (and getting answers might involve reaching back out to the community)

Who are the other people looking for this data? for what purpose? (researchers, universities, other projects, etc..)
What troubles have they encountered? (complexity, lack of support for their endpoints)
How is our approach going to provide value? (more modular, more accurate data, etc...)

This is a bit "stream of consciousness" but I think important things to ask before diving in too deep. The answer may end up being "there is no good reason, it'd just be fun" - which is fine too. :)

elf-pavlik commented 8 years ago

What makes my attempt better than others?

Also how it fits together with other attempts already present in current ecosystem. IMO README should list all the other know and relevant efforts. Besides mentioned before AGROVOC I also recall Practical Plants

tkeifer commented 8 years ago

Great point. That site is a great reference, I had not come across it before. I've been studying it for a couple hours and there are a lot of interesting things we can learn from how they've approached it.

The thing that amazes me is the sheer depth and breadth of characteristics there are for even a narrow sampling of plants. Organizing all of that is going to be a real challenge.

I like the focus on "useful plants" - more broad than vegetables, but less difficult to approach than the entirety of the natural kingdom.

andru commented 8 years ago

@elf-pavlik somehow I've not come across AGROVOC before. Thanks for sharing. Agreed on identifying similar projects early. Edit: see the wiki

@tkeifer I like your approach to identifying a problem and addressing needs. I'll try to be guided by it as I get some thoughts out.

Thanks for mentioning Practical Plants; it's a wiki I developed some time back and it's in large part my experience in pulling that project together which motivates a lot of my desire to see better open data available, particularly in the areas of names and physical and environmental data, and to see better tools for collectively stewarding that data.

It's also my work on a wip agro-tech app Hortomatic and my involvement in the ongoing semantic preparation of the Flora of North America which informs some of the direction I'd like to see on the technical side, but that's for a later conversation.

Here's my thoughts on some problems that need to be solved and some goals and ethos for the project... stream of conscious style while getting some of these thoughts out...

Problems to solve / niches to serve

Lower the barrier that projects like FarmBot, GrowStuff, PracticalPlants, Hortomatic, and I'm sure others have come across: a neat idea for an agrotech project which needs some basic plant data stumbles because the open data to support it is not easily available / project starts building database / every project duplicates effort which could be shared
Today, using plant data usually means forking if you want to make changes; contributing needs to be facile to prevent changes accrued by a data consumer being stuck in a fork. A simplified model to accept contributions both singularly and in batch would be a game changer.
A simple and complete dataset of multilingual plant names is surprisingly hard to come accross even for a subset of common agricultural crops
Likewise for data on timing and environmental tolerances (GDD, short day/long day and max/min hours of light to flowering, min survivable temperatures, etc)

Some ideas for Goals

Identify some common plant data needs in bootstrapping agro-tech; keep the niche small
Build a data-set of openly licensed plant data relevant to agriculture, and the tools necessary to manage it
Identify and engage a community of beneficiaries of the project to steward it's creation and maintenance
Where building tools for data management, identify other communities who might benefit from similar tools and collaborate

Related thoughts

Nature is tricky and there's rarely one true answer; the project should allow for the multiplicity of truth (see: wikidata)
Distributed > Centralised
When building tools, go small and modular
A small and complete database > A large and stubby one

roryaronson commented 8 years ago

Great to see this conversation happening all :)

With FarmBot, at the very least, we'd like to access a database of common edible crop names and representative icons. While we could create and host that alone, we see a lot of value in pooling the effort to allow many apps to use/create/maintain the data. We're currently planning to use the OpenFarm API for this need, because eventually we want to use OpenFarm guide data as well.

Regarding centralized vs distributed, I think a hybrid approach would be cool. Distributed in the sense that different apps will exist in order to create special data around the crops/do things with the data for that specific application. And centralized in the sense that we maintain a 'canonical' data set that all apps are using/contributing to. As far as I understand, that's why we want to separate the crops db from OpenFarm guides in the first place. Let OpenFarm do guides, Hortomatic do garden planning/tracking, FarmBot do its thing, etc; and the crops db be the shared resource among them all.

roryaronson commented 8 years ago

As far as the types of data this crop db holds, I think it would be neat to not set out any limitations, but rather allow it to grow in any direction based on the overlapping needs of the beneficiaries. So for example, we might start out with just common names. Then we find out that two beneficiaries (FarmBot and Hortomatic) want icons. So those two communities can spearhead that component of the data set. Then we find out that group X and OpenFarm want to share photos. So those two can spearhead that.

Just spit ballin here :)

mstenta commented 8 years ago

Hey everyone! Thanks for getting this conversation started!

I am building a farm management platform called farmOS (http://farmos.org), and I have been putting some thought into designing a standardized data type for storing cultivation data for various crops/varieties/species. It would be great if we could all work together on a common data format that is useful to everyone!

I agree with a lot of the things said above. Here are some thoughts I would add (or echo):

Species in general - Plants are important, and maybe it makes sense to focus on them first, but I will also have a need for other taxonomies like animal species, fungal species, etc. Basically anything that can be farmed (and maybe even that would be thinking too small). It probably makes sense to start with a focus on plants, because it seems that is the most common use-case - but it would be nice to keep a wide scope from the beginning so that we have the potential to grow as time goes on. So I agree that the "Crops" namespace might be too narrow.
Machine-readability and storage - My preference would be that all records are stored first and foremost in a machine readable format (JSON? YAML?). And I think they should be stored in files, as opposed to a database. This would allow them to essentially be databases unto themselves, which could be shared, imported/exported, and managed in source control. Any application that needs to use them could then either reference the files directly, or import them and translate to it's own native data formats.
Inheritance - There are potentially unlimited species/varieties, and most will share the traits of their parents or of more general groups. So I would love to develop some form of inheritance, whereby one species can reference another, and thereby inherit the traits of it. Perhaps even multiple-inheritance would be possible, but we would have to decide on how conflicts are resolved.
Distributed - Ultimately I don't see this being stored in one single collection or repository. There are just too many possible varieties out there. I see many many many different repositories, all with their own purposes. OpenFarm could provide one themselves, with the crops that it cares about. Seed companies could provide databases of their own offerings. And if I decided to breed a new variety of tomato, I could make my own! Thus we could also create libraries of species, where one could pick-and-choose the ones that they care about.

So really, I think what I am leaning towards is more of a "standard data format" than a specific collection. Perhaps the definition of that data format could be it's own Git repository, and OpenFarm's collection of crops could be a separate one, which the OpenFarm web app refers to for information. (I am not very familiar with the OpenFarm architecture, so I'm not sure if that makes sense or not.)

The main challenge in collaborating on something like this is defining what actual data we all need represented, and where that is the same and where that is different. I'll organize my notes and post some of those details soon.

Excited to continue the conversation!

andru commented 8 years ago

Species in general - Plants are important, and maybe it makes sense to focus on them first, but I will also have a need for other taxonomies like animal species, fungal species, etc ... Distributed - Ultimately I don't see this being stored in one single collection or repository. There are just too many possible varieties out there. I see many many many different repositories, all with their own purposes

I agree with your ideas on plurality. There is rarely a one size fits all approach to data. To me it seems to me that a good dividing line for purpose here is horticulture; keeping the scope and by extension the schema restricted to something manageable.

If the project works we could clone the model for similar databases covering related agricultural domains.

And if I decided to breed a new variety of tomato, I could make my own! Thus we could also create libraries of species, where one could pick-and-choose the ones that they care about.

I think there's absolutely scope for someone's new variety of tomato to be in this crop database, but I agree with your general point on the power of a distributed dataset. To function well I think distribution either needs a central authority (e.g. a git origin), as @roryaronson mentioned, or we would need to come up with some good standards to keep the data compatible, because merging data is messy.

@roryaronson I think you talk a lot of sense when it comes to what the database holds. That we grow it organically based on our needs. There's also some overlap with your thoughts here Mike... if I need GDD data for Hortomatic and nobody else does, then I could start a database with just GDD data, but using some shared standards on naming, schema, etc, to make that database compatible.

andru commented 8 years ago

Actually, rolling on from that last comment. Maybe we should be making multiple, distinct databases which share a common naming scheme, each covering a small purpose.

These would be a product of our collective needs but for example...

Names and taxonomy
Environmental data (tolerances, GDD)
Photos

A shared naming scheme is the tricky part. There are often multiple ways to refer to a crop depending on the nomenclature used, so we'd have to come up with some strict rules on which nomenclature gets used where. In a single database this is usually solved by just using an arbitrary id, but I think the id should be human readable in this case.

pmackay commented 8 years ago

A few questions/comments:

There is some great goals content in this issue, would it be worth extracting out into a wiki page, separate from the conversation? It also helps to separate out possible tech solutions and preferences from user needs IMHO. Could a set of user stories be developed? e.g.

As a food/farming website developer, I need access to a simple API for food plants that gives key information (fill in specifics) and links to other relevant resources, so that there is less duplication of effort in creating crops datasets.

There's also (at least) 2 key groups of needs:

the types of data captured about food/plants
the means of storage/access/revisions that could be used
any others?

What's missing from these sources that doesnt meet your needs?

The main challenge in collaborating on something like this is defining what actual data we all need represented, and where that is the same and where that is different.

So what about starting to develop a linked data model? Or simply a set of models and their properties, which could be translated into linked data formats later.

Quick background: I'm quite interested in this area, have worked on OpenFoodNetwork a fair bit, a little on an API for Growstuff and explored food data modelling on Freebase before it was eaten by Google.

tkeifer commented 8 years ago

I was thinking about this conversation over the weekend, trying to think about it at a high-level and had some insights... (bear with my explanation)

It seems to me we are struggling with a question of how to efficiently represent what is essentially plant genetics. For any given living thing really, subtle changes in genetic makeup result in traits which we (as humans) then classify into manageable groups. In our scenario - fruits and vegetables, and all their divisions. The result is a near infinite combination of characteristics that we could potentially need to represent in a database - as gene identification technology advances we could find out our fairly myopic view of the diversity in our vegetables is actually tremendously more than we imagined. While "Tomato A" looks exactly like "Tomato B", it may have a single gene difference that makes it more cold-tolerant and, as such, would be referred to by a completely different name.

Obviously, we can't gene sequence every single vegetable and store that data for the average backyard-gardener to search (though that would be cool), so we need to abstract it out a little bit. So looking at it from the opposite perspective, I said "how do two people currently differentiate between different crops?" I realized that we do this by evaluating a very small amount of traits, most of which are visual. I'll use peppers as an example - If we look at one that is round and orange, we all agree - "that is a habenero." In lieu of hard, scientific fact - we generally go with a loose naming convention by majority rule.

So what is my point? I envision some sort of object-storage mechanism, which allows attributes to be applied and then grouped through a crowd-sourced type of mechanism. "Object A" is placed into the database and very small core set of attributes are fixed - height, sun requirements, spacing recommendations, etc... are applied. The rest - specifically names, varieties, etc... are left to a kind of crowd-sourced tagging mechanism. If 50 ppl look at a picture of our object and say "Thats a tomato", we go with tomato. If a tomato expert logs in and says "that's a cold-weather, cherry tomato" we apply the tags. There could be some sort of weighting applied to bubble good tags up.

I dont know if such an object-storage type of mechanism exists, but I thought I'd throw this out there and see what you guys thought.

pmackay commented 8 years ago

If the models, properties, data, etc are to be useful to a wide range of groups, I wouldnt use tagging. Would be much more beneficial to define a strongly typed set of models and properties. However a system that allows people to enter the information like a wiki based on those models could be good.

tkeifer commented 8 years ago

Could you explain "define a strongly typed set of models and properties" in more detail? I'm not sure I follow.

pmackay commented 8 years ago

Basically what's now being debated in #5. So define a model, e.g. Crop, and the properties it can have, e.g. the list started here https://github.com/openfarmcc/Crops/wiki/Crop-data-needs.

mstenta commented 8 years ago

@tkeifer Take a look at #5 - I'm basically suggesting we provide a very small core set of data and let other third-party datasets extend it.

mstenta commented 8 years ago

... that would allow for the basic set of attributes to be defined, and then other people to define "varieties" or "cultivars" that extend it. I think that jives with what you're saying, yea?

tkeifer commented 8 years ago

I missed that... it looks close though! My experience has been that even the simplest of assumptions around types, variety, etc.. tends to fail hardcore in the plant world, so I was trying to think of a method of reference that was super flexible and didn't involve many rules. I'm interested to see how that idea evolves.

andru commented 8 years ago

I think taxonomy is a useful compromise. We accept in naming a cultivar that the genetics are variable and that a term like Brassica oleracea 'Early Purple Sprouting' represents an arbitrary genetic community which we choose to label for our own needs.

Taxonomy accepts this because the alternative, attempting to model the huge complexity of genetics, is not only practically impossible, but I don't see why it would be desirable.

When a group of plants has genetically diverged from another enough to have different qualities, and defining that group of genetics as distinct is useful to humans, we assign a new name in order that we can communicate about it. To me the lack of 1-to-1 mapping with genetics is not a flaw we need to figure out, it's a fundamentally useful abstraction we can't do without.

"how do two people currently differentiate between different crops?" I realized that we do this by evaluating a very small amount of traits, most of which are visual.

I'd say visual traits are no more important to food crops than any of the other traits. There is also taste, aroma, texture, life stage timing, shelf life, environmental tolerances and preferences, etc. These things cannot be detected and recorded without rigorous study, and I wouldn't trust a digital crowd sourced methodology to get it right.

The rest - specifically names, varieties, etc... are left to a kind of crowd-sourced tagging mechanism. If 50 ppl look at a picture of our object and say "Thats a tomato", we go with tomato. If a tomato expert logs in and says "that's a cold-weather, cherry tomato" we apply the tags.

I find no fault in your method, only that this is more or less how taxonomy has worked for generations and the result is the taxonomy and nomenclature we currently use.

I might be misunderstanding something in your proposal, but I don't see an improvement over current taxonomy. Could you expand on what problem it solves?

tkeifer commented 8 years ago

@andru - I hadn't yet seen the taxonomy that was started, so it was not meant to be a discussion of how another approach would be better really.

To clarify though... I was not suggesting we represent the plants genetically in the repository, only pointing out that in the absence of hard scientific differentiators (looking at pictures on the internet or browsing a farmers market, for example) people revert to visual indicators.

To use your example - the average person talking to a farmer is probably more unlikely to look at something and say "that is Brassica oleracea" than they are "that is Early Purple Sprouting" - so it might make sense to approach building a database from a less classification-intensive way than the traditional Family-Genus-Species model. This also takes into account the fact that a large majority of users may be operating below these levels anyway in their discussion of cultivars and varieties over genus and species.

Hope that helps clarify somewhat...

andru commented 8 years ago

the average person talking to a farmer is probably more unlikely to look at something and say "that is Brassica oleracea" than they are "that is Early Purple Sprouting"

Thanks for clarifying. Totally agreed that scientific nomenclature is unfamiliar to most people. I think we need to use the taxonomic model for providing structural relationships to the data and have a very extensive list of common names in all languages so that people can access the data in whatever way is familiar to them

roryaronson commented 8 years ago

@andru I agree that while the scientific nomenclature will likely not be used by most apps or people, it seems the best thing we have for structuring the data. Each app can then choose to only represent the common names if desired.

Mageistral commented 8 years ago

I had some thoughts after the talking around climates and website like this one http://worldweather.wmo.int/en/city.html?cityId=1058 or Wikipedia on "TOWN#climate" provides really good info. Another thing is that the seed providers give the date infos in a country/state context. I think it is not that hard to link climates to "gardeners profile" and derivate dates from the generic crop to the gardener's context.

I know this is not the priority but I wanted to write it down somewhere.

openfarmcc / Crops

Define goals [meta] #2