traitecoevo / austraits.build

Source for AusTraits
Other
16 stars 2 forks source link

Reworking value_type and observation_id; Adding observation_number and method_number #576

Closed ehwenk closed 2 years ago

ehwenk commented 2 years ago

This "issue" represents some structural updates we're discussing to (1) better capture the entity being measured; (2) better indicate value type; (3) provide a workflow for situations where a trait is repeatedly measured using different methods; (4) an entity (i.e. plant) is repeatedly measured across time or under different contextual conditions.

It addresses 5 issues, Issue #552, Issue #501, Issue #510, Issue #568, and Issue #567

In summary:

  1. Our current field value_type will be split into 3 fields, entity_type, value_type and basis_of_value.

    • entity_type captures whether the measurement is on an individual, population (1 site), multiple populations (multiple sites; still deciding if meta-population can/should be used), or species. Other projects using the AusTraits workflow might also want to have genus as an allowable value.
    • value_type indicates the mathematical manipulation/numeric value category. Allowable value_types will be: raw_value, min, max, mean, mode (for categorical traits), range, and bin (for situations where an author has created explicit bins and trait values are assigned by bin, not as an explicit value or range; this is common for life history trait data). Ranges and bins will be formatted as 3--5, with the double-dash a way to circumvent Excel reformatting a single dash into a date.
    • basis_of_value will capture how the measurement was determined. We are still finalising the list of allowable values, but suggestions are: measurement(trait is measured), model(trait value is a model output), assumed/implied (for circumstances where there is a small amount of guesswork or assumption made to assign the trait value), synthesis (instances where the author is assigning a trait value [usually categorical], based on a synthesis of multiple lines of information; a number of "tolerance-style" and "fire response" life history traits would fit this), and expert_score (for categorical traits where someone makes a declaration about the trait value, such as "this is a tree" or "there is a taproot" or "the guard cells have hairs")
    • This would solve Issue #568
    • It would also solve Issue #567, in those circumstances where there are separate min and max entries for a trait
  2. Our current observation_id will become entity_id.

  1. We will add columns observation_number and method_number. These will default to 1 (or .na (?)), with the ability to add an alternative for specific rows of data (for observation_number). If AusTraits reads in the same trait name twice, it will simply assign the second set of values for that trait to method_2. Having an observation_number and method_number will make it possible to properly spread all of AusTraits; observation_number and method_number will be retained as rows. Within a single dataset, if the data are spread into wide format, our function would reformat the trait_names as specific_leaf_area_method_1 and specific_leaf_area_method_2.
ehwenk commented 2 years ago

Some tricky studies that will benefit from various aspects of these changes:

dfalster commented 2 years ago

Hi @ehwenk , I've fixed the issue with not discarding values, but this raises a new issue: it fails when running unit conversions

ehwenk commented 2 years ago

@dfalster We didn't think about that - this suggests there has to be a special process for all range values as they are read in, to separate, convert units, and reunite as a range. I'll stop adding custom R code to create ranges (needed in 31 studies) until we have more of a plan.

ehwenk commented 2 years ago

We should take the opportunity of making these changes to add a trigger to "detach" the site data from certain measurements. I am wondering if there should be a basis_of_value which indicates literature and therefore these values do not have the site & context data linked to them.

ehwenk commented 2 years ago

A suggestion for how we change the current observation_id numbering format when we now call it entity_id

species-level observations would be: Falster_2003_sp001 (with the number of leading zeros determined using your make_id function) population-level observations would number from this: Falster_2003_sp001_pop01 individual-level observations would number from this: Falster_2003_sp001_pop01_ind01 (metapopulations would be keyed as part of populations or species; we can choose, but that is so rarely used)

This way it would be

The downside is that it is a longer identifier than what we currently have.

For instance if all data is at the individual or species level it would be Falster_2003_sp001_ind01 or if all data at the individual level, regardless of context, site it would be Falster_2003_ind001 since there is no mapping to population required. Personally, I think I'd drop pop when a study doesn't have any population-level traits, but I'd always retain sp as a container, so you'd get down to Falster_2003_sp001_ind01

I don't think population would have any meaning across species in a study (i.e. wouldn't be explicitly linked to context or site); it would just be populations for the species for that study.

There are lots of studies with 1 row of data per species, but a mix of species-level and population-level measurements. Do we have an override under dataset that indicates "don't re-number" or is it fine that these are a mix of Falster_2003_sp001 and Falster_2003_sp001_pop1

ehwenk commented 2 years ago

I've been thinking more about how to ensure population and individual level measurements are aligned.

As we're now using it, population represents a group of species in a location and under a specific context. However, context also includes temporal context, which generally represents repeat observations on an individual and isn't relevant for defining what the meaning of population is for that study.

Which made me go another step - when we are recording a column that is the entity_id (old observation_id), if we use a entity_id scheme that is hierarchical, we need to also specify if that column is specifying individuals, populations or species. Therefore, I suggest that instead of calling the field under dataset a generic entity_id it can have 1 of 2 names, population_id or individual_id. No need for species_id because that is always based on the taxon_name.

As we do now, nothing is specifying if there is no need to specify it.

But if the dataset is a mix of individual and population, then you need to specify a population_id to indicate the lumping level; this could be 1 column or several (e.g. site and context, or in the case of Cernusak_2006, site + 1 component of context)

If the dataset has different measures on the same individual in different rows (different traits or repeat measurements) you need to specify an individual_id.

There are only 28 studies with both population and individual level measurements and I've check the first 6 and they all fit nicely with this scheme.

ehwenk commented 2 years ago

closed with 4dab6ab