Instances when a `trait_name` is read in multiple times in the metadata.yml file

ehwenk commented 2 years ago

There are a handful of studies where there are multiple columns of data mapped onto the same trait_name. This presents a problem any time you want to take AusTraits data and spread/pivot_wider data because a column is duplicated that "shouldn't be" (e.g. observation_id, methods for a single trait_name x taxon_name x dataset_id

austraits$methods %>% select(dataset_id, trait_name) %>% group_by(dataset_id, trait_name) %>% count() %>% filter(n>1)

In some instances, this is because the same individual has been measured at different points in time and the replicated measurements are presented as different columns in a wide data format. These files will be made "longer" once we have individual_id to link multiple observation_id values on the same individual. (example, Geange_2017 - see issue #510 )
In some instances, the same trait is measured using two different methods, with data presented across multiple columns (e.g. Wright_2006, where Huber value was measured for different branch lengths). We'd been ambivalent about creating a measurement_id, but this suggests it would be useful. The measurement_id should simply be a combination of observation_id and trait_name and a sequential counter. In this case: Wright_2006_001_huber_value_measurement1 vs Wright_2006_001_huber_value_measurement2
Third, are many cases where there are separate entries for a expert_min vs expert_max values. For example, in Barlow_1981:
```
unit_in: mm
trait_name: leaf_length
value_type: expert_min
replicates: .na
methods: unknown
```
var_in: leaf length maximum unit_in: mm trait_name: leaf_length value_type: expert_max replicates: .na methods: unknown```

This - and the vast majority of others - are floras or taxonomic treatments where the leaf, seed, etc. dimensions have been programmatically deconstructed into the minimum and maximum and are read in as separate traits. I'm not sure what the best solution is here, because it depends on the use case.

ehwenk commented 2 years ago

Two separate enhancements to the AusTraits workflow will address this issue.

For circumstances where there is currently a min and max, the data will be recoded as a range
For circumstances where a trait is being measured using two different methods, we will code them as method_1 and method_2.
For circumstances where two sets of trait measurements are made on an entity (i.e. repeat measurements), we will code them as observation_1 and observation_2.

ehwenk commented 2 years ago

closed with 4dab6ab

traitecoevo / austraits.build

Instances when a `trait_name` is read in multiple times in the metadata.yml file #567