traitecoevo / austraits.build

Source for AusTraits
Other
16 stars 2 forks source link

Need better scheme for `method contexts` #654

Closed ehwenk closed 1 year ago

ehwenk commented 1 year ago

Currently, there are instances where the same trait is measured in the same dataset using multiple methods. These instances all have a method_context assigned that indicates the differences between the two methods. However, this method_context is captured only in austraits$methods (all information) and austraits$traits (method_id), which means there are still 2 entries for a given trait_name x dataset_id combination in austraits$methods with no indication of which row goes with which measurements. As a result, the actual trait measurements for those trait_name x dataset_id combinations are being duplicated, with each measurement assigned all methods.

For now we are going to standardise the methods entered and move more information to method context. This solves the duplication, but isn't a good long term fix.

A better solution is to separate method_context from the other contextual properties, since method context is a specialisation of method, not a context of an observation. Method context would be better mapped through the method's table, but this is a much bigger fix.

Below is code, not quite finished, that adds method_id's to the methods table, although only for those method contexts that are generated in metadata[["traits"]]. It is clunky, but a possible direction to expand upon.


  # identify sources as being `primary`, `secondary` or `original`
  # secondary datasets are additional publications associated with the primary citation
  # original dataset keys are used for compilations indicating the original data sources
  citation_types <- 
    tibble::tibble(
      source_key = names(metadata$source),
      type = str_replace_all(.data$source_key, "_[:digit:]+", ""),
      source_id = metadata$source %>%
        util_list_to_df2() %>%
        dplyr::select(.data$key)
    ) 

  source_primary_key <- metadata$source$primary$key
  source_secondary_keys <- citation_types %>% dplyr::filter(.data$type == "secondary") %>% dplyr::select(.data$source_id) %>% as.vector() 
  source_secondary_keys <- source_secondary_keys$source_id$key %>% as.vector()
  source_original_dataset_keys <- citation_types %>% dplyr::filter(.data$type == "original") %>% dplyr::select(.data$source_id) %>% as.vector()
  source_original_dataset_keys <- source_original_dataset_keys$source_id$key %>% as.vector()

  # combine collectors to add into the methods table
  collectors_tmp <-
    stringr::str_c(contributors$given_name, " ",
                   contributors$last_name,
                   ifelse(!is.na(contributors$additional_role),
                          paste0(" (", contributors$additional_role, ")"),
                          ""))  %>% paste(collapse = ", ")

  # contexts that are of category = method
 # browser()
  if(nrow(contexts) > 0) {
     if(nrow(contexts %>% filter(category == "method")) > 0) {
    method_contexts_tmp <-
      contexts %>%
        filter(category == "method") %>% 
        distinct(var_in)
     } else {
       method_contexts_tmp <- tibble::tibble(var_in = "XX")
     }
  } else {
    method_contexts_tmp <- tibble::tibble(var_in = "XX")
  }

  # identify which method contexts come from metadata[["traits"]]
  # XXXX But I know it can't just be searching for method_contexts_tmp$var_in[[1]] - needs a seq_along, but this is a start

  if(nrow(contexts) > 0) { 
    traits_columns <-  metadata[["traits"]] %>%
      util_list_to_df2() %>% dplyr::select(dplyr::any_of(method_contexts_tmp$var_in[[1]]))
  } else {
    traits_columns <- tibble::tibble()
  }

  # TRUE/FALSE variable, which is TRUE if there are method contexts keyed in through metadata[["traits]] and FALSE if all metadata contexts come from columns
  # if FALSE, then there is no need for those method_ids to be in the methods tables and in fact they can't be in the methods table, because they are likely
  # to vary across rows of data, not simply across trait entries

  traits_columns_tmp <- ifelse(ncol(traits_columns) == 0 | is.null(traits_columns), FALSE, TRUE)

  #browser()

  methods <-
    dplyr::full_join( by = "dataset_id",
      # methods used to collect each trait
      metadata[["traits"]] %>%
        util_list_to_df2() %>%
        dplyr::filter(!is.na(.data$trait_name)) %>%
        dplyr::mutate(dataset_id = dataset_id) %>%
  #      dplyr::select(dataset_id, .data$trait_name, .data$methods, dplyr::any_of("method_context"))
        dplyr::select(dataset_id, .data$trait_name, .data$methods, dplyr::any_of(method_contexts_tmp$var_in[[1]]))
      ,
      # study methods
      metadata$dataset %>%
        util_list_to_df1() %>%
        tidyr::spread(.data$key, .data$value) %>%
        dplyr::select(dplyr::any_of(names(metadata$dataset))) %>%
        dplyr::mutate(dataset_id = dataset_id) %>%
        dplyr::select(-dplyr::any_of(c("original_file", "notes", "data_is_long_format", "taxon_name", 
                                         "trait_name", "population_id", "individual_id",
                                         "location_name", "source_id", "value", "entity_type", 
                                         "collection_date", "custom_R_code", "replicates", "measurement_remarks",
                                         "taxon_name", "basis_of_value", "basis_of_record", "life_stage")))
      )  %>%
      full_join( by = "dataset_id",
      # references
        tibble::tibble(
          dataset_id = dataset_id,
          source_primary_key = source_primary_key,
          source_primary_citation = bib_print(sources[[source_primary_key]]),
          source_secondary_key = source_secondary_keys %>% paste(collapse = "; "),
          source_secondary_citation = ifelse(length(source_secondary_keys) == 0, NA_character_,
            purrr::map_chr(sources[source_secondary_keys], bib_print) %>% paste(collapse = "; ") %>%
              stringr::str_replace_all("\\.;", ";")
            ),                    
          source_original_dataset_key = source_original_dataset_keys %>% paste(collapse = "; "),
          source_original_dataset_citation = ifelse(length(source_original_dataset_keys) == 0, NA_character_,
            purrr::map_chr(sources[source_original_dataset_keys], bib_print) %>% paste(collapse = "; ") %>%
            stringr::str_replace_all("\\.;", ";")
          )
        )
      ) %>%
    dplyr::mutate(
      data_collectors = collectors_tmp,
      assistants = ifelse(is.null(metadata$contributors$assistants), NA_character_,
                                      metadata$contributors$assistants),
      austraits_curators = metadata$contributors$austraits_curators
    )

  method_contexts <- 
    context_ids$contexts %>%
      filter(category == "method", link_id == "method_id", var_in %in% traits_columns) %>%
      rename(method_context = value, method_id = link_vals) %>%
      select(method_context, method_id)

  if(nrow(method_contexts) > 0 ) {
    methods <-
      methods %>%
      left_join(by = "method_context", method_contexts) %>% 
      select(c("dataset_id", "trait_name", "method_id", "method_context", "methods"), everything())
  } else {
    methods <-
      methods %>%
      mutate(method_id = NA_character_)
  }

  methods %>% 
  select(-any_of(c("method_context"))) %>%
  select(c("dataset_id", "trait_name", "method_id", "methods"), everything())
}
ehwenk commented 1 year ago

The following studies currently have much of their methods moved to method contexts for traits that have 2 entries, to avoid duplication:

dfalster commented 1 year ago

@ehwenk has this been addressed now?

ehwenk commented 1 year ago

This is still an issue that needs to be solved and is the trickiest of the issues I know of, because it requires method context to be more separated from other contexts. The current fix is not really satisfactory - not least, because it would be easy to input two traits with different methods and not realise why rows of data are being duplicated when you merge in the methods table.

yangsophieee commented 1 year ago

Moving issue to traits.build.