traitecoevo / APD

The Australian Plant Traits Dictionary
https://traitecoevo.github.io/APD/
4 stars 2 forks source link

Typos in correspondence with GIFT #28

Open Rekyt opened 2 months ago

Rekyt commented 2 months ago

Hi @ehwenk & @dfalster,

while building the trait correspondence network I noticed some issues with the correspondence with GIFT traits (I still have to do the same for BIEN and TRY). I basically checked that both the trait codes provided by AusTraits were in GIFT, as well as the trait names, and that the provided GIFT traits names were matching the provided GIFT trait codes.

The script I used is below. But I'll first detail my findings.

  1. For trait_0030020 & trait_0030015 the GIFT_close contains multiple traits as a single line. Is it on purpose? Because other matched traits span multiple lines.
  2. For trait_0030215, there is a typo in the GIFT_exact name as it is referenced "Fuiting time" missing an r.
  3. For trait_0030020, there is a typo in the GIFT code 'leaf_thorns_1 [GIFT:4:14.1]' which should be 'leaf_thorns_1 [GIFT:4.14.1]'.
  4. Several GIFT traits names are written following AusTraits' convention and not GIFT's 'seed_height' instead of Seed height.
  5. Capitalization of trait names isn't following GIFT's names, APD tend to use snake_case while GIFT uses Camel_snake_case. For example, GIFT's name referenced in APD is 'flower_colour' [APD:trait_0012417] while in GIFT the trait is 'Flower_clour' [GIFT:3.21.1].
  6. There is an error in the GIFT match with trait_0030060, GIFT_close matches with GIFT 1.4.1 (Climber_1) while it should match with GIFT 3.4.1 (Reproduction_sexual_1). This was the trait that triggered my systematic search for potential mismatches, as I obtained in the correspondence network a much larger connected component than expected with traits that shouldn't be matching.

Maybe you could use an adaptation of the script below to perform semi-automated quality checks when updating the APD?

For the sake of completeness, I'll try performing the same checks for TRY and BIEN.

Matching script ```r library("dplyr") gift_trait_meta = GIFT::GIFT_traits_meta() apd_gift_detailed = tibble::as_tibble(read.csv("APD_traits_input.csv")) |> select(identifier:label, starts_with("GIFT")) |> rename(trait_id = identifier) |> tidyr::pivot_longer( starts_with("GIFT"), names_to = "match_type", values_to = "matched_trait" ) |> filter(matched_trait != "") |> mutate( # Split for traits that have multiple matches on one line split_traits = purrr::map(stringr::str_split(matched_trait, ";"), trimws), # Extract GIFT trait name extracted_trait = purrr::map( split_traits, \(x) stringr::str_extract(x, "^(.*)\\s\\[", group = 1) ), # Extract GIFT trait code extracted_code = purrr::map( split_traits, \(x) stringr::str_extract(x, "\\[GIFT:(.+)\\]", group = 1) ), # Get level gift_lvl = purrr::map( extracted_code, \(x) stringr::str_count(x, stringr::fixed(".")) + 1L ) ) |> # Put everything in a tidy format tidyr::unnest(split_traits:gift_lvl) ## Level 2 traits # Matching code at level 2 apd_gift_lvl2 = apd_gift_detailed |> filter(gift_lvl == 2) |> left_join( gift_trait_meta |> distinct(Lvl2, Trait1), by = c(extracted_code = "Lvl2") ) # Problematic traits apd_gift_lvl2 |> filter((extracted_trait != Trait1) | is.na(Trait1)) ## Level 3 traits apd_gift_lvl3 = apd_gift_detailed |> filter(gift_lvl == 3) |> left_join( gift_trait_meta |> distinct(Lvl3, Trait2), by = c(extracted_code = "Lvl3") ) # Problematic traits apd_gift_lvl3 |> filter((extracted_trait != Trait2) | is.na(Trait2)) ```
ehwenk commented 1 month ago

@Rekyt Thank you for documenting these! I'll make the changes on a branch tomorrow.

It is sometimes intentionally to have multiple close matches within a single cell in the csv files. The code that builds the formal ontology will split those into multiple lines, each assigned as an example of type "close_match". But I'll double check the one you mentioned to ensure there isn't something else wrong.