kaiiam / mifc

A minimum information standard checklist formalizing the description of food composition data and related metadata.
MIT License
2 stars 1 forks source link

Additional Metadata Fields for Compatibility with FDC and PTFI #19

Open GregLyonJIFSAN opened 1 week ago

GregLyonJIFSAN commented 1 week ago

There are some fields that MIFC would need to store metadata used by Food Data Central and Periodic Table of Foods Initiative. These would fall under the food slots category.

Fields we need that are represented in Food Data Central sample data:

food_acquisition_store_name food_brand_name food_sell_by_date food_sample_lot_number food_acquisition_label_weight food_acquisition_label_weight_unit

Fields we need that are in Periodic Table of Foods Initiative metadata:

food_breed_or_cultivar (e.g. "sockeye") food_organism_name (e.g. "Oncorhynchus nerka")

GregLyonJIFSAN commented 6 days ago

Per our discussion this morning, we should also add:

food_country_of_origin

This corresponds to the PTFI field "Food Origin Location" which contains a two-letter country code.

kaiiam commented 2 days ago

Thanks again @GregLyonJIFSAN for submitting this issue! I'll tackle these one at a time.

food_country_of_origin

Yes agreed we need a field like this, while parsing SR I also found this but just didn't get around to adding it to the schema yet. I was thinking food_origin_country to have less characters in the labels, perhaps we need to write a style guide for attribute names.

The more important issues is whether to go with ISO 2 or 3 letter country codes. I am pretty convinced by the arguments in https://andrewwhitby.com/2021/01/08/the-right-country-codes/ for 3 letter being better (basically it's less ambiguous than the 2 letter code). If anyone thinks the 2 letter code is better though please speak up. I'll make a new issue for this. My thinking was to have this be a an ENUM in MIFC with all the iso 3 letter codes and their names, perhaps we can also map that to their iso 2 letter codes so that all the info they need is in MIFC to map if they are using the 2 letter code (such as the PTFI data).

kaiiam commented 2 days ago

food_acquisition_store_name

I think we should follow the pattern we have with food_acquisition_location_type where the FoodAcquisitionLocationTypes (enum from PTFI's metadata sheet) which includes places other than just grocery stores e.g. a farm field biobank etc. So for this I propose we go with food_acquisition_location_name which could be the name of a store (e.g. Safeway).

food_brand_name

I think this makes sense as is, perhaps one could argue for company name instead but I think brand is understandable to many people around the world.

food_sell_by_date

MIFC currently has food_expiration_date but at least in the USA there are several dates associated with foods see this post from USDA Food Safety Inspection Service where they describe "Best if Used By/Before", "Sell-By", “Use-By", “Freeze-By” dates. This publication from UConn describes more. I think this could be it's own issue too to get all of these in. For now I can add food_sell_by_date.

kaiiam commented 1 day ago

food_sample_lot_number

Lets go with the label food_lot_number

Description: A string denoting the identifying lot number number assigned by a manufacturer to a particular quantity or lot of the sampled primary food material.

Comment: Can also be referred to as lot code.

kaiiam commented 1 day ago

food_acquisition_label_weight

Lets go with food_label_weight as were using acquisition for when the a primary food material is sampled from a point of collection (aka bought by researchers at a store).

Description: A float denoting the weight, (or mass on earth) as specified on the product label of a sampled primary food material.

Comment: This field should be used in along with food_acquisition_label_weight_unit to express the units the food_label_weight was measured in.

food_acquisition_label_weight_unit

Lets go with food_label_weight_unit

Description: A unit code representing the unit of measurement in which a food_label_weight is measured.

kaiiam commented 11 hours ago

Looking into the last two requests:

food_breed_or_cultivar (e.g. "sockeye")

food_organism_name (e.g. "Oncorhynchus nerka")`

For the organism name I believe it's usually referred to as binomial nomenclature or a scientific name (of an organism), e.g., Brassica oleraceae. We also might want to be able to annotate the scientific name of food which is primarily that organism but with some additives or minimal processing, e,g., dried, salted Nori seaweed, or filleted anchovies canned in oil. Hence I think we should start the attribute name with "food_primary_type". Perhaps we can go with something like food_primary_type_scientific_name

For cultivar we might want to separate this out more. For example the The International Code of Nomenclature for Cultivated Plants (ICNCP) governs rules for naming cultivated versions plants (including algae and fungi). From my understanding of the ICNCP system they have three possible designations Cultivar, Group, and Grex (horticulture).

From this https://biologydictionary.net/cultivar/ post, in the literature could express a plant cultivar name with: Scientific name (Group name) ‘Cultivar name’ and example of which is: Brassica oleraceae (Capitata) ‘King Cole’. Note that's a wild cabbage cultivar that I couldn't find elsewhere online.

For MIFC to incorporate the ICNCP system using a more common cultivar as example the Savoy King cultivar of "Savoy Cabbage" we could do the following: food_primary_type_scientific_name -> "Brassica oleraceae" food_primary_type_icncp_group_name -> "Capitata" food_primary_type_icncp_cultivar_name -> "Savoy King"

Although it should be noted that sometimes "variety" and "form" are used in scientific names which I'm understanding the ICNCP system doesn't use? E.g. Savoy cabbage can have "Brassica oleracea var. sabauda L." as species name also "Brassica oleracea var. capitata f. sabauda". In which case one of those scientific names could go in food_primary_type_scientific_name. It could also make sense to have a link to an NCBITaxon identifier. In this case "Brassica oleracea var. sabauda" has Taxonomy ID: 1216010 which could be found from the taxon browser https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1216010 as well as in an ontology format from http://purl.obolibrary.org/obo/NCBITaxon_1216010. Foodon is adding these linkages within their terms but for MIFC if not using foodon perhaps having a slot for the NCBITaxon ID could be useful as well. e.g., food_primary_type_scientific_name_taxon_id.

The example of sockeye Salmon actually points to a different issue. Oncorhynchus nerka is a unique species, and sockeye is one of several synonyms/common names for that species. At the moment this could be handled in MIFC by annotating the food_primary_type with a vocabulary/ontology system like the Foodon Ontology (FOODON) such as "FOODON:02021848" which has a corresponding food_primary_type_label as "sockeye salmon". Generic food composition tables might not always make use of a controlled vocabulary (but that's a separate conversation if we force this to be required in MIFC).

For animals there exists the International Code of Zoological Nomenclature (ICZN) code which in addition to the bionimal species name can also include a subspecific name or subspecies name e.g., Giraffa camelopardalis rothschildi. In zoology the subspecies is the only rank below species and can be added to the scientific name like in the example above. So for MIFC the food_primary_type_scientific_name should be sufficient to capture the trinomial name if know.

However this still doesn't address the issue of breeds. From what I'm gathering in my research, unlike with plants that have a system for cultivar names, it seems for animals like dogs or cattle there are just names of breeds, e.g., Wagyu Cattle from resources like https://breeds.okstate.edu/cattle/wagyu-cattle.html that are managed by national Registries and Breed Associations. So perhaps we just need to add a MIFC attribute like food_primary_type_breed_name

kaiiam commented 10 hours ago

I'm not sure however, to what extent the ICNCP system is used in food science or botany generally. Would need input from someone with more expertise in the field. Although there is probably more discussion needed with the content of the previous post, for now we could add the following slots:

food_primary_type_scientific_name food_primary_type_icncp_group_name food_primary_type_icncp_cultivar_name food_primary_type_animal_breed_name

I'd be open to arguments to remove the ICNCP reference if it's not useful but will wait to get more feedback before doing so.

Although more discussion maybe helpful we could also add a slot for the NCBITaxon identifier or ontology CURIE. Which could be something like food_primary_type_ncbi_taxon_id.