Some contexts reflect varying methods or time

ehwenk commented 2 years ago

One of the more complex parts of my entity_id coding is how to partition context information to population_id, vs contextual properties that broadly reflect shifts in time or variations in sampling method. My scheme works, but it is fussy to implement for studies where unique combinations of context_name and site_name are not the basis of creating population_id values.

Most contextual properties capture experimental treatments or some aspect of population that isn’t reflecting in site (most often some stratified sampling at each site, so different “populations” of plants are under different contexts). However, there are a number of contexts that represent time (sampling season) or some component of method (old vs young leaves; sun vs shade leaves).

I have started remapping any context that is effectively a sampling strategy or method to method_number matched with information in measurement_remarks. It is much less elegant than having a context property (e.g. leaf position) and the different values (e.g. canopy, understory) mapped as contexts, but simpler for building population_id. Much harder is how to treat contexts that reflect shifts in time, because it generally isn't the same individuals being resampled. And measurement_remarks is meant to pair with method information, not time information.

We should brainstorm some more about this, but one option is to have a code where the context properties are listed that reflects if a context property capture different populations, treatments, time or methods. The first two would indicate variation that needs to be captured in population_id, while time-related context properties would auto-generate sequential observation_number, and method sequential method_number. Then context could continue to capture all curated contextual information.

ehwenk commented 2 years ago

@dfalster

Talking this through and looking back at the ontology for our database structure, it is clear that context as defined represents contexts that apply to different entities (species, population, individual, point-in-time) and also at the measurement-level. I managed to partially unwind this within individual metadata files by adding extra temporary identifiers. But this is neither a clear solution nor one that is sufficiently transparent for others to implement easily. I therefore recommend that we split context into 4 sections, reflecting types of context:

location_contexts: Contexts that are linked to individuals that represent a sub-population within a site_name. For instance, there are individuals with different seed provenance, vegetation type, natural fire history, plant sex. treatment_contexts: All contexts reflecting explicit, researcher-enacted experimental treatments across individuals will be captured under this context. These include temperature, CO2, and light manipulations. observation_contexts: To capture repeat measurements made on a single individual or at a population across time. Such measurements are most often linked to sampling season or point in a plants life cycle, but in some experimental datasets, it could be repeat measurements at different times of day or different points in a watering cycle. measurement_contexts: To capture multiple measurements made as part of a single observation, such as measurements at different canopy positions, on different leaf ages, on different branch thicknesses, or using subtly different methods.

location_contexts and treatment_contexts are both contexts that are alternatively linked to individuals (if that is the entity_type) or populations (if that is the entity_type). The format for these will remain the same as before, with invented context_names that reflect a merging of all factorial context combinations, and a linked table with details about the individual properties. (NOTE: these could remain merged - and maybe should - let's discuss pros & cons of each approach.)

observation_contexts: These will be identified by sequential numbers (observation_number) that are assigned separately via a column, custom_R_code, or using the observation_number field linked to individual traits. Because sometimes observation numbers are sequential columns and other times rows of data, this flexibility is required. Then there will be a field within dataset called observation_contexts where information will be documented in this way:

observation_contexts: 
  1: summer observation
  2: winter observation

--or--

observation_contexts:
  1: Well-watered plant measurement; Measurements made on the morning following a watering event (wet cycle) when the plants were at their least water-limited.
  2: Droughted plant measurement; Measurements made on the final day of a watering cycle (dry cycle) when the low water plants were at the driest point in the cycle.

method_contexts : These will be identified by sequential numbers (method_number) that are assigned separately via a column, custom_R_code, or using the method_number field linked to individual traits. They will be documented as per observation_contexts:

method_contexts:
  1: fresh leaves (indicating amount of leaf moisture)
  2: oven-dried leaves (indicating amount of leaf moisture)
  3: senesced leaves (indicating amount of leaf moisture)

-- or --

method_contexts:
  1: Measurements made on young leaves. 
  2: Measurements made on expanding leaves. 
  3: Measurements made on old leaves.

Happy to think through other solutions! (Will post a separate issue for unwinding entity_id)

ehwenk commented 2 years ago

I'm not quite sure how to code in information for contexts where some rows lack a value. For now, I'm doing the following for NA's, simply so we can find them quickly:

    - find: .na
      replace: .na
      description: .na

ehwenk commented 2 years ago

In order to make it possible for austraits to "pivot wider" by some combination of columns, I used "method number" to capture instances where a dataset had two rows of data for the same "entity". There are a number of datasets with species-level data with multiple rows of information for the same species. These now have sequential "method numbers", but the numbers have no inherent meaning. They are literally just repeat measurements, usually from a compilation with multiple reported measurements per species and no dates. They really aren't contexts. Just multiple "observations" in the sense of OBOE.

ehwenk commented 2 years ago

[x] There are some studies with two different "method contexts" that were stuffed into measurement remarks, and method_number was simply a "counter" to allow pivoting. This means we need to be able to recognise multiple method_context columns created within the traits . For now I'm calling them method_context, then method_context2. I'm not sure how the script is set up right now - is there a controlled list of column names that are recognised within the traits section of the metadata file?

ehwenk commented 2 years ago

Studies to look back at:

[x] Buckton_2019 - just repeat observations - work out how to format contexts
[x] Crous_2019 - repeat observations (up to 11) - work out how to format contexts
[ ] Kew_2019_2 - work out what to do about repeat observations; original_dataset_id
[ ] Kew_2019_4 - work out what to do about repeat observations; original_dataset_id
[x] Sams_2017 - removed method number, because it was separating different entity_type values; don't think this is necessary
[x] Wright_2001 - want to discuss how to partition site/context info
[x] Stephens_2021, Roderick_1999, Choat_2012, Groom_2012, Zanne_2009, Richards_2008, all Kew studies, - need to capture original_dataset_id; for many of these studies same species has records from different datasets, so one needs to create sequential observation_id values/capture original dataset_id as a context. But we don't want to have to list all the values.

ehwenk commented 2 years ago

[x] repeated observations. Applies to: Buckton_2019, Cooper_2013, Crous_2019, Kew_2019_2, Kew_2019_4, Lewis_2015, NSWFRD_2014, Richards_2008, Schmidt_1997, White_2020

There are a number of studies where there are repeat observations on the same entity, without any context to divide the different observations into specific groups. They are really just repeat observations across time. In some, the date column captures this, but this isn't always the case.

In order to make it possible for AusTraits to "pivot wider" by some combination of columns, I used observation number or method number to capture instances where a dataset had two rows of data for the same entity. I would group on all available contexts, sites, taxon names, individual id's and then simply add row numbers within each group. This creates sequential observation numbers, but the numbers have no inherent meaning.

For now repeat observations are read in as a context, but it doesn't make sense to list all the values in the metadata file. They are contexts of an observationCollection in the sense of OBOE, defining distinct observations

It would be nice to not have to list the following, sometimes up to much higher numbers:

  repeat observations:
    var_in: observation_number
    category: observation
    values:
    - find: 1
      replace: 1
      description: 1
    - find: 2
      replace: 2
      description: 2

but instead simply:

  repeat observations:
    var_in: observation_number
    category: observation

ehwenk commented 2 years ago

[ ] original_dataset_id. Applies to: Stephens_2021, Roderick_1999, Choat_2012, Groom_2012, Zanne_2009, Richards_2008, all Kew studies

There are several compilations where authors have indicated the primary reference as a column. I generally read this into measurement remarks, which works for some studies. But in other datasets, there are a few/many species measured by multiple sources. I'd previously mutated "method numbers" for these, usually simply row numbers, ignoring the actual source. One possibility is to list the "original_dataset_id" as a "observation context". But for some of the studies there are many (50+) individual references and it doesn't make sense to have to list these all under context values.

Or do we want to find a different way to deal with these original references? Some way to capture original_dataset_id separately from dataset_id

ehwenk commented 2 years ago

[x] I had thought there would be no need to ever specify population_id, but of course exceptions arise. In Firn_2019 and Crous_2019 (both experimental) there are "blocks" - For both we need to create population_id values that merge site x block x treatment contexts - or we need to create replicate "sites" that represent just a single "block". I prefer designating a population_id, but we should discuss.

ehwenk commented 2 years ago

[ ] Richards_2008: This is a compilation of many studies, and several have contexts that apply just to them. Trying to work out how best to deal with this. It seems likely that some should be split out into their own studies.

ehwenk commented 2 years ago

New issues arising from new context code:

[x] Studies without context don't build
[x] Because we now run context functions before parsing traits, we're not picking up contexts defined in the traits section (i.e. most of the methods ones) - see Bloomfield_2018, Wright_2019
[x] If the context description is more than 1 line, and has then been split into 2 line numbers by read/write_yml, this breaks the workflow - SOLVED by adding quotes (") around entire text.
[ ] in instances where the find value is numeric, matching is (mostly) not happening to create link_IDs - see Duan_2015 as one example. .... no, the problem is broader than just numeric. Also true for Firn_2019
[ ] There is going to have to be a mechanism to use the repeat observation numbers (or similar) as part of the id's for various categories, because otherwise this information gets dropped. The id's are the only way to retain observation numbers. I think it would be easiest to simply use the value from the original column as the replace value if find is NA (i.e. doesn't exist) - ONE SOLUTION is to fill in find only, and then add this line at line 214, mutate(replace = ifelse(is.na(replace),find,replace)).

traitecoevo / austraits.build

Some contexts reflect varying methods or time #610