traitecoevo / APD

The Australian Plant Traits Dictionary
https://traitecoevo.github.io/APD/
4 stars 2 forks source link

Typos in correspondence with TRY #29

Open Rekyt opened 3 months ago

Rekyt commented 3 months ago

Similarly to #28. Let's look at the correspondence with TRY.

I've performed a similar matching of codes and names in TRY, and found few typos (see the detailed script below).

  1. Same remarks as for GIFT, some APD traits have several matching traits on the same line for TRY, e.g., trait_0030810 has two traits matching on GIFT_close.
  2. Names are globally matching but some names correspondence are off because TRY silently modified the names of the trait. The names can be updated accordingly by matching the TraitID in an updated TRY traits table (downloadable through TRY website: https://www.try-db.org/de/DnldTraitList.php).
  3. More serious are the non-corresponding codes. It seems some matching are wrong because of this. For example, 'leaf_cell_wall_N_per_cell_wall_dry_mass ' [APD:trait_0001511] is referenced as having a close match with 'Leaf cell wall nitrogen (N) per unit cell wall dry mass' referenced as [TRY:96] in APD, however, this traits corresponds to 'Seed oil content per seed mass'. While given the matched name it should be matching with [TRY:3377]. See the script for more example of this
Matching script ```r try_traits = readr::read_delim("tde2024422162351.txt", skip = 3, col_select = -6) apd_try_detailed = tibble::as_tibble(read.csv("APD_traits_input.csv")) |> select(identifier:label, starts_with("TRY")) |> rename(trait_id = identifier) |> tidyr::pivot_longer( starts_with("TRY"), names_to = "match_type", values_to = "matched_trait" ) |> filter(matched_trait != "") |> mutate( # Split for traits that have multiple matches on one line split_traits = purrr::map(stringr::str_split(matched_trait, ";"), trimws), # Extract GIFT trait name extracted_trait = purrr::map( split_traits, \(x) stringr::str_extract(x, "^(.*)\\s\\[", group = 1) ), # Extract GIFT trait code extracted_code = purrr::map( split_traits, \(x) stringr::str_extract(x, "\\[TRY:(.+)\\]", group = 1) |> as.numeric() ) ) |> tidyr::unnest(split_traits:extracted_code) apd_try_smaller = apd_try_detailed |> # Match names based on trait code left_join( try_traits |> distinct(TraitID, name_matched_on_code = Trait), by = c(extracted_code = "TraitID") ) |> # Match code based on trait name left_join( try_traits |> distinct(code_matched_on_name = TraitID, Trait), by = c(extracted_trait = "Trait") ) select(trait, extracted_trait, extracted_code, name_matched_on_code, code_matched_on_name) ## Potentially problematic traits # non-matching names according to code apd_try_smaller |> filter(extracted_trait != name_matched_on_code) # non-matching code according to name apd_try_smaller |> filter(extracted_code != code_matched_on_name) ```
ehwenk commented 2 weeks ago

Addressed (2) and (3) above with https://github.com/traitecoevo/APD/commit/b69f40c800be44de1101bf806e6b79de151d9633

(1) is intentional.

ehwenk commented 2 weeks ago

@Rekyt Can you check if the changes on this branch look good to you? There are ~3 traits that won't match because we have to change ";" to "," in the names.

Thank you for pointing out the inconsistencies, especially those places where we have an incorrect TRY number-name match.

Rekyt commented 1 week ago

With the updated APD_traits_input.csv file, I only the traits you mention because of the substitution of semi-colons by commas and also of three dots being converted to an actual ellipsis character , so it should be fine!

Also, I haven't mentioned it elsewhere, but as you may have guessed, I didn't find any issues with trait matched on BIEN. It's simpler of course because it has only 53 traits.