Reworking value_type and observation_id; Adding observation_number and method_number

ehwenk commented 2 years ago

This "issue" represents some structural updates we're discussing to (1) better capture the entity being measured; (2) better indicate value type; (3) provide a workflow for situations where a trait is repeatedly measured using different methods; (4) an entity (i.e. plant) is repeatedly measured across time or under different contextual conditions.

It addresses 5 issues, Issue #552, Issue #501, Issue #510, Issue #568, and Issue #567

In summary:

Our current field value_type will be split into 3 fields, entity_type, value_type and basis_of_value.
- entity_type captures whether the measurement is on an individual, population (1 site), multiple populations (multiple sites; still deciding if meta-population can/should be used), or species. Other projects using the AusTraits workflow might also want to have genus as an allowable value.
- value_type indicates the mathematical manipulation/numeric value category. Allowable value_types will be: raw_value, min, max, mean, mode (for categorical traits), range, and bin (for situations where an author has created explicit bins and trait values are assigned by bin, not as an explicit value or range; this is common for life history trait data). Ranges and bins will be formatted as 3--5, with the double-dash a way to circumvent Excel reformatting a single dash into a date.
- basis_of_value will capture how the measurement was determined. We are still finalising the list of allowable values, but suggestions are: measurement(trait is measured), model(trait value is a model output), assumed/implied (for circumstances where there is a small amount of guesswork or assumption made to assign the trait value), synthesis (instances where the author is assigning a trait value [usually categorical], based on a synthesis of multiple lines of information; a number of "tolerance-style" and "fire response" life history traits would fit this), and expert_score (for categorical traits where someone makes a declaration about the trait value, such as "this is a tree" or "there is a taproot" or "the guard cells have hairs")
- This would solve Issue #568
- It would also solve Issue #567, in those circumstances where there are separate min and max entries for a trait
Our current observation_id will become entity_id.

We chose the word entity (not individual) since, as described above, an entity can be an individual, a population, or a species
The reason for this change is that are a few studies where repeat measurements have been made on an individual; in these circumstances entity needs to represent the individual across all points in time & contexts, while observation would be the points measured at a single point in time or under a single context. It follows, from a data-structure perspective, that entity and observation are separate concepts that represent different levels in our data-structure hierarchy.
In addition to changing the name, there are two follow-on changes.
First, for wide datasets, AusTraits currently can't read in a column that indicates core "entity" identifiers. This needs to be fixed, so that if there are multiple rows of data for the same entity those trait measurements are assigned the same entity_id
Second, currently in a single dataset, there are often a mix of individual-level and species-level trait values. For instance, there are 5 replicate individuals per species on which specific leaf area in measured, but all individuals of a species have the same photosynthetic pathway, since that is a species-level score. The species-level trait values are "deduplicated", but currently a value is simply linked to the first row of data for the species. We will change the workflow so that the species-level value is given a separate entity_ID - and it will also have a correspondingly different entity_type. For each species, all species-level traits within the dataset will share an entity_ID.
This would solve Issue #501

We will add columns observation_number and method_number. These will default to 1 (or .na (?)), with the ability to add an alternative for specific rows of data (for observation_number). If AusTraits reads in the same trait name twice, it will simply assign the second set of values for that trait to method_2. Having an observation_number and method_number will make it possible to properly spread all of AusTraits; observation_number and method_number will be retained as rows. Within a single dataset, if the data are spread into wide format, our function would reformat the trait_names as specific_leaf_area_method_1 and specific_leaf_area_method_2.

This would solve Issue #510, because for instances where context is applies across rows, we'd reformat the data into a longer format using custom_R_code, then assign both separate contexts and separate observation numbers to the two sets of trait values.
It would also solve Issue #552 - it would replace the need for having a full measurement_ID
And it would solve the instances described Issue #567 where the same trait is read in twice with multiple methods, by assigning two separate method_numbers
Alternatively, we could have a single very long identifier that combines all possible information, in the format: dataset_id_entity_id_observation_1_trait_name_method_1 and then it would be decomposed to its various components depending on how you're querying AusTraits.-

ehwenk commented 2 years ago

Some tricky studies that will benefit from various aspects of these changes:

[x] Choat_2006 - repeat measurements on the same individual at different times of day. Currently this information is listed in context, but we don't currently have any way to indicate that these as repeat measurements on the same individual.

dfalster commented 2 years ago

Hi @ehwenk , I've fixed the issue with not discarding values, but this raises a new issue: it fails when running unit conversions

ehwenk commented 2 years ago

@dfalster We didn't think about that - this suggests there has to be a special process for all range values as they are read in, to separate, convert units, and reunite as a range. I'll stop adding custom R code to create ranges (needed in 31 studies) until we have more of a plan.

ehwenk commented 2 years ago

We should take the opportunity of making these changes to add a trigger to "detach" the site data from certain measurements. I am wondering if there should be a basis_of_value which indicates literature and therefore these values do not have the site & context data linked to them.

ehwenk commented 2 years ago

A suggestion for how we change the current observation_id numbering format when we now call it entity_id

species-level observations would be: Falster_2003_sp001 (with the number of leading zeros determined using your make_id function) population-level observations would number from this: Falster_2003_sp001_pop01 individual-level observations would number from this: Falster_2003_sp001_pop01_ind01 (metapopulations would be keyed as part of populations or species; we can choose, but that is so rarely used)

This way it would be

easy to write a function that created a wide dataset because linkages between different entity resolutions in a study are retained, not numbered separately/sequentially
for studies where some traits are a single bulked or mean population value and others are individual replicates, the link to being part of the same "population" (same site or context) would be retained.

The downside is that it is a longer identifier than what we currently have.

We could drop underscores and just use s, p, i instead of sp, pop, ind.
We could drop certain pieces of the name if they don't pertain to a study.

For instance if all data is at the individual or species level it would be Falster_2003_sp001_ind01 or if all data at the individual level, regardless of context, site it would be Falster_2003_ind001 since there is no mapping to population required. Personally, I think I'd drop pop when a study doesn't have any population-level traits, but I'd always retain sp as a container, so you'd get down to Falster_2003_sp001_ind01

I don't think population would have any meaning across species in a study (i.e. wouldn't be explicitly linked to context or site); it would just be populations for the species for that study.

There are lots of studies with 1 row of data per species, but a mix of species-level and population-level measurements. Do we have an override under dataset that indicates "don't re-number" or is it fine that these are a mix of Falster_2003_sp001 and Falster_2003_sp001_pop1

ehwenk commented 2 years ago

I've been thinking more about how to ensure population and individual level measurements are aligned.

As we're now using it, population represents a group of species in a location and under a specific context. However, context also includes temporal context, which generally represents repeat observations on an individual and isn't relevant for defining what the meaning of population is for that study.

Which made me go another step - when we are recording a column that is the entity_id (old observation_id), if we use a entity_id scheme that is hierarchical, we need to also specify if that column is specifying individuals, populations or species. Therefore, I suggest that instead of calling the field under dataset a generic entity_id it can have 1 of 2 names, population_id or individual_id. No need for species_id because that is always based on the taxon_name.

As we do now, nothing is specifying if there is no need to specify it.

As per now, long dataset default to species or species x site.
If in a wide dataset each row represents a single species, population or individual measure, there is also no need to specify a column.

But if the dataset is a mix of individual and population, then you need to specify a population_id to indicate the lumping level; this could be 1 column or several (e.g. site and context, or in the case of Cernusak_2006, site + 1 component of context)

If the dataset has different measures on the same individual in different rows (different traits or repeat measurements) you need to specify an individual_id.

There are only 28 studies with both population and individual level measurements and I've check the first 6 and they all fit nicely with this scheme.

ehwenk commented 2 years ago

closed with 4dab6ab

traitecoevo / austraits.build

Reworking value_type and observation_id; Adding observation_number and method_number #576