Closed ehwenk closed 2 years ago
@dfalster
Talking this through and looking back at the ontology for our database structure, it is clear that context
as defined represents contexts that apply to different entities (species, population, individual, point-in-time) and also at the measurement-level. I managed to partially unwind this within individual metadata files by adding extra temporary identifiers. But this is neither a clear solution nor one that is sufficiently transparent for others to implement easily. I therefore recommend that we split context into 4 sections, reflecting types
of context:
location_contexts:
Contexts that are linked to individuals that represent a sub-population
within a site_name
. For instance, there are individuals with different seed provenance, vegetation type, natural fire history, plant sex.
treatment_contexts:
All contexts reflecting explicit, researcher-enacted experimental treatments across individuals will be captured under this context. These include temperature, CO2, and light manipulations.
observation_contexts:
To capture repeat measurements made on a single individual or at a population across time. Such measurements are most often linked to sampling season or point in a plants life cycle, but in some experimental datasets, it could be repeat measurements at different times of day or different points in a watering cycle.
measurement_contexts:
To capture multiple measurements made as part of a single observation, such as measurements at different canopy positions, on different leaf ages, on different branch thicknesses, or using subtly different methods.
location_contexts
and treatment_contexts
are both contexts that are alternatively linked to individuals (if that is the entity_type) or populations (if that is the entity_type). The format for these will remain the same as before, with invented context_names
that reflect a merging of all factorial context combinations, and a linked table with details about the individual properties. (NOTE: these could remain merged - and maybe should - let's discuss pros & cons of each approach.)
observation_contexts
: These will be identified by sequential numbers (observation_number
) that are assigned separately via a column, custom_R_code, or using the observation_number
field linked to individual traits. Because sometimes observation numbers are sequential columns and other times rows of data, this flexibility is required. Then there will be a field within dataset
called observation_contexts
where information will be documented in this way:
observation_contexts:
1: summer observation
2: winter observation
--or--
observation_contexts:
1: Well-watered plant measurement; Measurements made on the morning following a watering event (wet cycle) when the plants were at their least water-limited.
2: Droughted plant measurement; Measurements made on the final day of a watering cycle (dry cycle) when the low water plants were at the driest point in the cycle.
method_contexts
: These will be identified by sequential numbers (method_number
) that are assigned separately via a column, custom_R_code, or using the method_number
field linked to individual traits. They will be documented as per observation_contexts
:
method_contexts:
1: fresh leaves (indicating amount of leaf moisture)
2: oven-dried leaves (indicating amount of leaf moisture)
3: senesced leaves (indicating amount of leaf moisture)
-- or --
method_contexts:
1: Measurements made on young leaves.
2: Measurements made on expanding leaves.
3: Measurements made on old leaves.
Happy to think through other solutions! (Will post a separate issue for unwinding entity_id
)
I'm not quite sure how to code in information for contexts where some rows lack a value. For now, I'm doing the following for NA's, simply so we can find them quickly:
- find: .na
replace: .na
description: .na
In order to make it possible for austraits to "pivot wider" by some combination of columns, I used "method number" to capture instances where a dataset had two rows of data for the same "entity". There are a number of datasets with species-level data with multiple rows of information for the same species. These now have sequential "method numbers", but the numbers have no inherent meaning. They are literally just repeat measurements, usually from a compilation with multiple reported measurements per species and no dates. They really aren't contexts. Just multiple "observations" in the sense of OBOE.
measurement remarks
, and method_number
was simply a "counter" to allow pivoting. This means we need to be able to recognise multiple method_context
columns created within the traits
. For now I'm calling them method_context
, then method_context2
. I'm not sure how the script is set up right now - is there a controlled list of column names that are recognised within the traits section of the metadata file?Studies to look back at:
original_dataset_id
original_dataset_id
original_dataset_id
; for many of these studies same species has records from different datasets, so one needs to create sequential observation_id
values/capture original dataset_id as a context. But we don't want to have to list all the values.repeated observations
. Applies to: Buckton_2019, Cooper_2013, Crous_2019, Kew_2019_2, Kew_2019_4, Lewis_2015, NSWFRD_2014, Richards_2008, Schmidt_1997, White_2020There are a number of studies where there are repeat observations on the same entity, without any context
to divide the different observations into specific groups. They are really just repeat observations across time. In some, the date column captures this, but this isn't always the case.
In order to make it possible for AusTraits to "pivot wider" by some combination of columns, I used observation number
or method number
to capture instances where a dataset had two rows of data for the same entity
. I would group on all available contexts, sites, taxon names, individual id's and then simply add row numbers within each group. This creates sequential observation numbers
, but the numbers have no inherent meaning.
For now repeat observations
are read in as a context, but it doesn't make sense to list all the values in the metadata file. They are contexts of an observationCollection
in the sense of OBOE, defining distinct observations
It would be nice to not have to list the following, sometimes up to much higher numbers:
repeat observations:
var_in: observation_number
category: observation
values:
- find: 1
replace: 1
description: 1
- find: 2
replace: 2
description: 2
but instead simply:
repeat observations:
var_in: observation_number
category: observation
original_dataset_id
. Applies to: Stephens_2021, Roderick_1999, Choat_2012, Groom_2012, Zanne_2009, Richards_2008, all Kew studiesThere are several compilations where authors have indicated the primary reference as a column. I generally read this into measurement remarks, which works for some studies. But in other datasets, there are a few/many species measured by multiple sources. I'd previously mutated "method numbers" for these, usually simply row numbers, ignoring the actual source. One possibility is to list the "original_dataset_id" as a "observation context". But for some of the studies there are many (50+) individual references and it doesn't make sense to have to list these all under context values.
Or do we want to find a different way to deal with these original references? Some way to capture original_dataset_id
separately from dataset_id
population_id
, but of course exceptions arise. In Firn_2019
and Crous_2019
(both experimental) there are "blocks" - For both we need to create population_id
values that merge site x block x treatment contexts - or we need to create replicate "sites" that represent just a single "block". I prefer designating a population_id
, but we should discuss.New issues arising from new context code:
find
value is numeric, matching is (mostly) not happening to create link_IDs - see Duan_2015 as one example. .... no, the problem is broader than just numeric. Also true for Firn_2019id's
for various categories, because otherwise this information gets dropped. The id's
are the only way to retain observation numbers. I think it would be easiest to simply use the value from the original column as the replace
value if find
is NA (i.e. doesn't exist) - ONE SOLUTION is to fill in find
only, and then add this line at line 214, mutate(replace = ifelse(is.na(replace),find,replace))
.
One of the more complex parts of my
entity_id
coding is how to partitioncontext
information topopulation_id
, vs contextual properties that broadly reflect shifts in time or variations in sampling method. My scheme works, but it is fussy to implement for studies where unique combinations ofcontext_name
andsite_name
are not the basis of creatingpopulation_id
values.Most contextual properties capture experimental treatments or some aspect of population that isn’t reflecting in site (most often some stratified sampling at each site, so different “populations” of plants are under different contexts). However, there are a number of contexts that represent time (sampling season) or some component of method (old vs young leaves; sun vs shade leaves).
I have started remapping any context that is effectively a
sampling strategy
ormethod
tomethod_number
matched with information inmeasurement_remarks
. It is much less elegant than having a context property (e.g.leaf position
) and the different values (e.g.canopy
,understory
) mapped as contexts, but simpler for buildingpopulation_id
. Much harder is how to treat contexts that reflect shifts in time, because it generally isn't the same individuals being resampled. And measurement_remarks is meant to pair with method information, not time information.We should brainstorm some more about this, but one option is to have a code where the context properties are listed that reflects if a context property capture different
populations
,treatments
,time
ormethods
. The first two would indicate variation that needs to be captured inpopulation_id
, whiletime
-related context properties would auto-generate sequentialobservation_number
, andmethod
sequentialmethod_number
. Then context could continue to capture all curatedcontextual information
.