Errors reading in `entity_type`, `value_type` from columns

traitecoevo / austraits.build

Source for AusTraits

Other

16 stars 2 forks source link

Errors reading in `entity_type`, `value_type` from columns #725

Closed ehwenk closed 1 year ago

ehwenk commented 1 year ago

value_type should be able to be read in from a column, but it isn't working at the moment for AusTraits or for AusInverTraits
it seems that we added an exception for entity_type in line 1066 of process.R, which doesn't make sense; but when I remove this exception, it still isn't read in from a column
See also issue #674 - which is about adding "unit_in" to the list of fields that can be read in. There are datasets where different rows of data have different units for a trait, especially for scraped morphology datasetes (AusTraits & AusInverTraits). At the moment custom_R_code is being used to align units, but this is clunky.

ehwenk commented 1 year ago

I've worked out what why we excluded entity_type from the columns to be read in. The allowable values for entity type specified in schema are individual, population, metapopulation, species, genus, family, and order. There are many studies where there will be columns with these same names and then the values in the column are being used as the entity_type rather than the fixed value. Therefore, we need to specify that entity_type can only be read in from a column that is NOT one of the allowable entity_type values specified in schema.

ehwenk commented 1 year ago

With value_type the instances that weren't reading in correctly were for long datasets where value_type was specified for each trait, rather than a single time for the dataset. When the column was instead specified at the dataset level, it worked properly. It seems like it should work either way, to let values for a species trait to be read in from a column? But I realise this might be hard with long datasets and for now the problem is solved.

ehwenk commented 1 year ago

And while the problems are solved for AusTraits, AusInverTraits is having problems with some of these fields. I'll check if it is because all their data are in long format.

dfalster commented 1 year ago

Ok, this commit should ensure that entity_type can only be read in from a column that is NOT one of the allowable entity_type or value_type values specified in schema.

To test

devtools::load_all()
source("scripts/custom.R")
resource_metadata <- get_schema("config/metadata.yml", "metadata")
definitions <- get_schema("config/traits.yml", "traits")
unit_conversions <- get_unit_conversions("config/unit_conversions.csv")
taxon_list <- read_csv_char("config/taxon_list.csv")
schema <- get_schema()

v <- "Brock_1993"
config <- dataset_configure(file.path("data", v, "metadata.yml"), definitions, unit_conversions)
raw <- dataset_process(file.path("data", v, "data.csv"), config, schema, resource_metadata)

raw$traits

Before

After

dfalster commented 1 year ago

@ehwenk also pointed out that when in long format, we might want to bring in columns of data for things like entity_type and value_type, and that currently this isn't possible

dfalster commented 1 year ago

moved to https://github.com/traitecoevo/traits.build/issues/6