There are a handful of studies where there are multiple columns of data mapped onto the same trait_name. This presents a problem any time you want to take AusTraits data and spread/pivot_wider data because a column is duplicated that "shouldn't be" (e.g. observation_id, methods for a single trait_name x taxon_name x dataset_id
In some instances, this is because the same individual has been measured at different points in time and the replicated measurements are presented as different columns in a wide data format. These files will be made "longer" once we have individual_id to link multiple observation_id values on the same individual. (example, Geange_2017 - see issue #510 )
In some instances, the same trait is measured using two different methods, with data presented across multiple columns (e.g. Wright_2006, where Huber value was measured for different branch lengths). We'd been ambivalent about creating a measurement_id, but this suggests it would be useful. The measurement_id should simply be a combination of observation_id and trait_name and a sequential counter. In this case: Wright_2006_001_huber_value_measurement1 vs Wright_2006_001_huber_value_measurement2
Third, are many cases where there are separate entries for a expert_min vs expert_max values. For example, in Barlow_1981:
unit_in: mm
trait_name: leaf_length
value_type: expert_min
replicates: .na
methods: unknown
var_in: leaf length maximum
unit_in: mm
trait_name: leaf_length
value_type: expert_max
replicates: .na
methods: unknown```
This - and the vast majority of others - are floras or taxonomic treatments where the leaf, seed, etc. dimensions have been programmatically deconstructed into the minimum and maximum and are read in as separate traits. I'm not sure what the best solution is here, because it depends on the use case.
Two separate enhancements to the AusTraits workflow will address this issue.
For circumstances where there is currently a min and max, the data will be recoded as a range
For circumstances where a trait is being measured using two different methods, we will code them as method_1 and method_2.
For circumstances where two sets of trait measurements are made on an entity (i.e. repeat measurements), we will code them as observation_1 and observation_2.
There are a handful of studies where there are multiple columns of data mapped onto the same
trait_name
. This presents a problem any time you want to take AusTraits data and spread/pivot_wider data because a column is duplicated that "shouldn't be" (e.g. observation_id, methods for a singletrait_name
xtaxon_name
xdataset_id
austraits$methods %>% select(dataset_id, trait_name) %>% group_by(dataset_id, trait_name) %>% count() %>% filter(n>1)
In some instances, this is because the same individual has been measured at different points in time and the replicated measurements are presented as different columns in a wide data format. These files will be made "longer" once we have
individual_id
to link multipleobservation_id
values on the same individual. (example, Geange_2017 - see issue #510 )In some instances, the same trait is measured using two different methods, with data presented across multiple columns (e.g. Wright_2006, where Huber value was measured for different branch lengths). We'd been ambivalent about creating a
measurement_id
, but this suggests it would be useful. Themeasurement_id
should simply be a combination ofobservation_id
andtrait_name
and a sequential counter. In this case:Wright_2006_001_huber_value_measurement1
vsWright_2006_001_huber_value_measurement2
Third, are many cases where there are separate entries for a
expert_min
vsexpert_max
values. For example, in Barlow_1981:var_in: leaf length maximum unit_in: mm trait_name: leaf_length value_type: expert_max replicates: .na methods: unknown```
This - and the vast majority of others - are floras or taxonomic treatments where the leaf, seed, etc. dimensions have been programmatically deconstructed into the minimum and maximum and are read in as separate traits. I'm not sure what the best solution is here, because it depends on the use case.