traitecoevo / austraits.build

Source for AusTraits
Other
16 stars 2 forks source link

Some contexts reflect varying methods or time #610

Closed ehwenk closed 2 years ago

ehwenk commented 2 years ago

One of the more complex parts of my entity_id coding is how to partition context information to population_id, vs contextual properties that broadly reflect shifts in time or variations in sampling method. My scheme works, but it is fussy to implement for studies where unique combinations of context_name and site_name are not the basis of creating population_id values.

Most contextual properties capture experimental treatments or some aspect of population that isn’t reflecting in site (most often some stratified sampling at each site, so different “populations” of plants are under different contexts). However, there are a number of contexts that represent time (sampling season) or some component of method (old vs young leaves; sun vs shade leaves).

I have started remapping any context that is effectively a sampling strategy or method to method_number matched with information in measurement_remarks. It is much less elegant than having a context property (e.g. leaf position) and the different values (e.g. canopy, understory) mapped as contexts, but simpler for building population_id. Much harder is how to treat contexts that reflect shifts in time, because it generally isn't the same individuals being resampled. And measurement_remarks is meant to pair with method information, not time information.

We should brainstorm some more about this, but one option is to have a code where the context properties are listed that reflects if a context property capture different populations, treatments, time or methods. The first two would indicate variation that needs to be captured in population_id, while time-related context properties would auto-generate sequential observation_number, and method sequential method_number. Then context could continue to capture all curated contextual information.

ehwenk commented 2 years ago

@dfalster

Talking this through and looking back at the ontology for our database structure, it is clear that context as defined represents contexts that apply to different entities (species, population, individual, point-in-time) and also at the measurement-level. I managed to partially unwind this within individual metadata files by adding extra temporary identifiers. But this is neither a clear solution nor one that is sufficiently transparent for others to implement easily. I therefore recommend that we split context into 4 sections, reflecting types of context:

location_contexts: Contexts that are linked to individuals that represent a sub-population within a site_name. For instance, there are individuals with different seed provenance, vegetation type, natural fire history, plant sex. treatment_contexts: All contexts reflecting explicit, researcher-enacted experimental treatments across individuals will be captured under this context. These include temperature, CO2, and light manipulations. observation_contexts: To capture repeat measurements made on a single individual or at a population across time. Such measurements are most often linked to sampling season or point in a plants life cycle, but in some experimental datasets, it could be repeat measurements at different times of day or different points in a watering cycle. measurement_contexts: To capture multiple measurements made as part of a single observation, such as measurements at different canopy positions, on different leaf ages, on different branch thicknesses, or using subtly different methods.

location_contexts and treatment_contexts are both contexts that are alternatively linked to individuals (if that is the entity_type) or populations (if that is the entity_type). The format for these will remain the same as before, with invented context_names that reflect a merging of all factorial context combinations, and a linked table with details about the individual properties. (NOTE: these could remain merged - and maybe should - let's discuss pros & cons of each approach.)

observation_contexts: These will be identified by sequential numbers (observation_number) that are assigned separately via a column, custom_R_code, or using the observation_number field linked to individual traits. Because sometimes observation numbers are sequential columns and other times rows of data, this flexibility is required. Then there will be a field within dataset called observation_contexts where information will be documented in this way:

observation_contexts: 
  1: summer observation
  2: winter observation

--or--

observation_contexts:
  1: Well-watered plant measurement; Measurements made on the morning following a watering event (wet cycle) when the plants were at their least water-limited.
  2: Droughted plant measurement; Measurements made on the final day of a watering cycle (dry cycle) when the low water plants were at the driest point in the cycle.

method_contexts : These will be identified by sequential numbers (method_number) that are assigned separately via a column, custom_R_code, or using the method_number field linked to individual traits. They will be documented as per observation_contexts:

method_contexts:
  1: fresh leaves (indicating amount of leaf moisture)
  2: oven-dried leaves (indicating amount of leaf moisture)
  3: senesced leaves (indicating amount of leaf moisture)

-- or --

method_contexts:
  1: Measurements made on young leaves. 
  2: Measurements made on expanding leaves. 
  3: Measurements made on old leaves. 

Happy to think through other solutions! (Will post a separate issue for unwinding entity_id)

ehwenk commented 2 years ago

I'm not quite sure how to code in information for contexts where some rows lack a value. For now, I'm doing the following for NA's, simply so we can find them quickly:

    - find: .na
      replace: .na
      description: .na
ehwenk commented 2 years ago

In order to make it possible for austraits to "pivot wider" by some combination of columns, I used "method number" to capture instances where a dataset had two rows of data for the same "entity". There are a number of datasets with species-level data with multiple rows of information for the same species. These now have sequential "method numbers", but the numbers have no inherent meaning. They are literally just repeat measurements, usually from a compilation with multiple reported measurements per species and no dates. They really aren't contexts. Just multiple "observations" in the sense of OBOE.

ehwenk commented 2 years ago
ehwenk commented 2 years ago

Studies to look back at:

ehwenk commented 2 years ago

There are a number of studies where there are repeat observations on the same entity, without any context to divide the different observations into specific groups. They are really just repeat observations across time. In some, the date column captures this, but this isn't always the case.

In order to make it possible for AusTraits to "pivot wider" by some combination of columns, I used observation number or method number to capture instances where a dataset had two rows of data for the same entity. I would group on all available contexts, sites, taxon names, individual id's and then simply add row numbers within each group. This creates sequential observation numbers, but the numbers have no inherent meaning.

For now repeat observations are read in as a context, but it doesn't make sense to list all the values in the metadata file. They are contexts of an observationCollection in the sense of OBOE, defining distinct observations

It would be nice to not have to list the following, sometimes up to much higher numbers:

  repeat observations:
    var_in: observation_number
    category: observation
    values:
    - find: 1
      replace: 1
      description: 1
    - find: 2
      replace: 2
      description: 2

but instead simply:

  repeat observations:
    var_in: observation_number
    category: observation
ehwenk commented 2 years ago

There are several compilations where authors have indicated the primary reference as a column. I generally read this into measurement remarks, which works for some studies. But in other datasets, there are a few/many species measured by multiple sources. I'd previously mutated "method numbers" for these, usually simply row numbers, ignoring the actual source. One possibility is to list the "original_dataset_id" as a "observation context". But for some of the studies there are many (50+) individual references and it doesn't make sense to have to list these all under context values.

Or do we want to find a different way to deal with these original references? Some way to capture original_dataset_id separately from dataset_id

ehwenk commented 2 years ago
ehwenk commented 2 years ago
ehwenk commented 2 years ago

New issues arising from new context code: