Closed ehwenk closed 2 years ago
Some tricky studies that will benefit from various aspects of these changes:
Hi @ehwenk , I've fixed the issue with not discarding values, but this raises a new issue: it fails when running unit conversions
@dfalster We didn't think about that - this suggests there has to be a special process for all range values as they are read in, to separate, convert units, and reunite as a range. I'll stop adding custom R code to create ranges (needed in 31 studies) until we have more of a plan.
We should take the opportunity of making these changes to add a trigger to "detach" the site data from certain measurements. I am wondering if there should be a basis_of_value
which indicates literature
and therefore these values do not have the site & context data linked to them.
A suggestion for how we change the current observation_id
numbering format when we now call it entity_id
species-level observations would be: Falster_2003_sp001
(with the number of leading zeros determined using your make_id function)
population-level observations would number from this: Falster_2003_sp001_pop01
individual-level observations would number from this: Falster_2003_sp001_pop01_ind01
(metapopulations would be keyed as part of populations or species; we can choose, but that is so rarely used)
This way it would be
The downside is that it is a longer identifier than what we currently have.
s
, p
, i
instead of sp
, pop
, ind
. For instance if all data is at the individual or species level it would be Falster_2003_sp001_ind01
or if all data at the individual level, regardless of context, site it would be Falster_2003_ind001
since there is no mapping to population required. Personally, I think I'd drop pop when a study doesn't have any population-level traits, but I'd always retain sp
as a container, so you'd get down to Falster_2003_sp001_ind01
I don't think population would have any meaning across species in a study (i.e. wouldn't be explicitly linked to context or site); it would just be populations for the species for that study.
There are lots of studies with 1 row of data per species, but a mix of species-level and population-level measurements. Do we have an override under dataset that indicates "don't re-number" or is it fine that these are a mix of Falster_2003_sp001
and Falster_2003_sp001_pop1
I've been thinking more about how to ensure population and individual level measurements are aligned.
As we're now using it, population represents a group of species in a location and under a specific context. However, context
also includes temporal context, which generally represents repeat observations on an individual and isn't relevant for defining what the meaning of population
is for that study.
Which made me go another step - when we are recording a column that is the entity_id
(old observation_id
), if we use a entity_id
scheme that is hierarchical, we need to also specify if that column is specifying individuals, populations or species. Therefore, I suggest that instead of calling the field under dataset a generic entity_id
it can have 1 of 2 names, population_id
or individual_id
. No need for species_id
because that is always based on the taxon_name.
As we do now, nothing is specifying if there is no need to specify it.
But if the dataset is a mix of individual
and population
, then you need to specify a population_id
to indicate the lumping level; this could be 1 column or several (e.g. site and context, or in the case of Cernusak_2006, site + 1 component of context)
If the dataset has different measures on the same individual in different rows (different traits or repeat measurements) you need to specify an individual_id
.
There are only 28 studies with both population and individual level measurements and I've check the first 6 and they all fit nicely with this scheme.
This "issue" represents some structural updates we're discussing to (1) better capture the entity being measured; (2) better indicate value type; (3) provide a workflow for situations where a trait is repeatedly measured using different methods; (4) an entity (i.e. plant) is repeatedly measured across time or under different contextual conditions.
It addresses 5 issues, Issue #552, Issue #501, Issue #510, Issue #568, and Issue #567
In summary:
Our current field
value_type
will be split into 3 fields,entity_type
,value_type
andbasis_of_value
.entity_type
captures whether the measurement is on an individual, population (1 site), multiple populations (multiple sites; still deciding if meta-population can/should be used), or species. Other projects using the AusTraits workflow might also want to havegenus
as an allowable value.value_type
indicates the mathematical manipulation/numeric value category. Allowable value_types will be: raw_value, min, max, mean, mode (for categorical traits), range, and bin (for situations where an author has created explicit bins and trait values are assigned by bin, not as an explicit value or range; this is common for life history trait data). Ranges and bins will be formatted as3--5
, with the double-dash a way to circumvent Excel reformatting a single dash into a date.basis_of_value
will capture how the measurement was determined. We are still finalising the list of allowable values, but suggestions are:measurement
(trait is measured),model
(trait value is a model output),assumed/implied
(for circumstances where there is a small amount of guesswork or assumption made to assign the trait value),synthesis
(instances where the author is assigning a trait value [usually categorical], based on a synthesis of multiple lines of information; a number of "tolerance-style" and "fire response" life history traits would fit this), andexpert_score
(for categorical traits where someone makes a declaration about the trait value, such as "this is a tree" or "there is a taproot" or "the guard cells have hairs")Our current
observation_id
will becomeentity_id
.entity
(not individual) since, as described above, an entity can be an individual, a population, or a speciesentity
needs to represent the individual across all points in time & contexts, whileobservation
would be the points measured at a single point in time or under a single context. It follows, from a data-structure perspective, thatentity
andobservation
are separate concepts that represent different levels in our data-structure hierarchy.entity
those trait measurements are assigned the sameentity_id
observation_number
andmethod_number
. These will default to1
(or.na
(?)), with the ability to add an alternative for specific rows of data (forobservation_number
). If AusTraits reads in the same trait name twice, it will simply assign the second set of values for that trait tomethod_2
. Having anobservation_number
andmethod_number
will make it possible to properly spread all of AusTraits; observation_number and method_number will be retained as rows. Within a single dataset, if the data are spread into wide format, our function would reformat the trait_names asspecific_leaf_area_method_1
andspecific_leaf_area_method_2
.This would solve Issue #510, because for instances where context is applies across rows, we'd reformat the data into a longer format using custom_R_code, then assign both separate contexts and separate observation numbers to the two sets of trait values.
It would also solve Issue #552 - it would replace the need for having a full measurement_ID
And it would solve the instances described Issue #567 where the same trait is read in twice with multiple methods, by assigning two separate method_numbers
Alternatively, we could have a single very long identifier that combines all possible information, in the format:
dataset_id_entity_id_observation_1_trait_name_method_1
and then it would be decomposed to its various components depending on how you're querying AusTraits.-