Open Rekyt opened 2 months ago
@Rekyt Thank you for documenting these! I'll make the changes on a branch tomorrow.
It is sometimes intentionally to have multiple close matches within a single cell in the csv files. The code that builds the formal ontology will split those into multiple lines, each assigned as an example of type "close_match". But I'll double check the one you mentioned to ensure there isn't something else wrong.
Hi @ehwenk & @dfalster,
while building the trait correspondence network I noticed some issues with the correspondence with GIFT traits (I still have to do the same for BIEN and TRY). I basically checked that both the trait codes provided by AusTraits were in GIFT, as well as the trait names, and that the provided GIFT traits names were matching the provided GIFT trait codes.
The script I used is below. But I'll first detail my findings.
trait_0030020
&trait_0030015
theGIFT_close
contains multiple traits as a single line. Is it on purpose? Because other matched traits span multiple lines.trait_0030215
, there is a typo in theGIFT_exact
name as it is referenced "Fuiting time" missing an r.trait_0030020
, there is a typo in the GIFT code 'leaf_thorns_1 [GIFT:4:14.1]' which should be 'leaf_thorns_1 [GIFT:4.14.1]'.trait_0030060
,GIFT_close
matches with GIFT 1.4.1 (Climber_1) while it should match with GIFT 3.4.1 (Reproduction_sexual_1). This was the trait that triggered my systematic search for potential mismatches, as I obtained in the correspondence network a much larger connected component than expected with traits that shouldn't be matching.Maybe you could use an adaptation of the script below to perform semi-automated quality checks when updating the APD?
For the sake of completeness, I'll try performing the same checks for TRY and BIEN.
Matching script
```r library("dplyr") gift_trait_meta = GIFT::GIFT_traits_meta() apd_gift_detailed = tibble::as_tibble(read.csv("APD_traits_input.csv")) |> select(identifier:label, starts_with("GIFT")) |> rename(trait_id = identifier) |> tidyr::pivot_longer( starts_with("GIFT"), names_to = "match_type", values_to = "matched_trait" ) |> filter(matched_trait != "") |> mutate( # Split for traits that have multiple matches on one line split_traits = purrr::map(stringr::str_split(matched_trait, ";"), trimws), # Extract GIFT trait name extracted_trait = purrr::map( split_traits, \(x) stringr::str_extract(x, "^(.*)\\s\\[", group = 1) ), # Extract GIFT trait code extracted_code = purrr::map( split_traits, \(x) stringr::str_extract(x, "\\[GIFT:(.+)\\]", group = 1) ), # Get level gift_lvl = purrr::map( extracted_code, \(x) stringr::str_count(x, stringr::fixed(".")) + 1L ) ) |> # Put everything in a tidy format tidyr::unnest(split_traits:gift_lvl) ## Level 2 traits # Matching code at level 2 apd_gift_lvl2 = apd_gift_detailed |> filter(gift_lvl == 2) |> left_join( gift_trait_meta |> distinct(Lvl2, Trait1), by = c(extracted_code = "Lvl2") ) # Problematic traits apd_gift_lvl2 |> filter((extracted_trait != Trait1) | is.na(Trait1)) ## Level 3 traits apd_gift_lvl3 = apd_gift_detailed |> filter(gift_lvl == 3) |> left_join( gift_trait_meta |> distinct(Lvl3, Trait2), by = c(extracted_code = "Lvl3") ) # Problematic traits apd_gift_lvl3 |> filter((extracted_trait != Trait2) | is.na(Trait2)) ```