traitecoevo / APD

The Australian Plant Traits Dictionary
https://traitecoevo.github.io/APD/
4 stars 2 forks source link

Typos in some correspondence with TRY traits? #25

Closed Rekyt closed 3 months ago

Rekyt commented 4 months ago

Hi @ehwenk and @dfalster 👋

As told in the PR #24 I'm using the raw APD_traits_input.csv to get trait correspondence across databases.

I noticed some issues with some columns in TRY (or at least that are non-standard?). I'm unsure about tackling these so I rather open an issue about them.

My routine is the following:

apd_try_traits = read.csv("APD_traits_input.csv") |>
  tibble::as_tibble() |>
  select(
    trait_id = identifier, trait, label, contains("BIEN"), contains("GIFT"),
    contains("TRY")
  ) |>
  # Get all traits for which there is an equivalent in TRY
  filter(if_any(contains("TRY"), \(x) x != "")) |>
  select(trait_id, trait, label, contains("TRY")) |>
  # Making data tidy
  tidyr::pivot_longer(
    contains("TRY"), names_to = "match_type", values_to = "match_value"
  ) |>
  filter(match_value != "") |>
  # Extract TRY TraitIDs
  mutate(
    extracted_trait = match_value |>
      stringr::str_extract_all("\\[TRY:\\d+\\]") |>
      purrr::map(stringr::str_remove, "\\[TRY:") |>
      purrr::map(stringr::str_remove,"\\]"),
    match_type =  stringr::str_extract(match_type, "[:alpha:]+"),
    # Count number of match traits
    length_extracted = purrr::map_int(extracted_trait, length)
  )

If I count the number of matched traits given the columns I get the following:

> apd_try_traits |>
+     count(length_extracted)
# A tibble: 4 × 2
  length_extracted     n
             <int> <int>
1                0     5
2                1   316
3                2     7
4                3     1

So 5 AusTraits traits, with non-empty columns have 0 matches given my extraction of TRY IDs.

If I go to see the strings in the columns I get:

> apd_try_traits |>
+     filter(length_extracted == 0) |>
+     pull(match_value)
[1] "specific leaf area [TO:0000562] (https://www.try-db.org/de/de.php)"                                  
[2] "Leaf epidermis cell area; Leaf mesophyll cell area [TRY:338; 573] (https://www.try-db.org/de/de.php)"
[3] "Bark thickness [TRY:24, TRY:3355, TRY:3356] (https://www.try-db.org/de/de.php)"                      
[4] "Bark thickness [TRY:24, 3355, 3356] (https://www.try-db.org/de/de.php)"                              
[5] "plant lifespan and age of first flowering [LEDA:1.3] (https://www.try-db.org/de/de.php)"          

For the first line, it matches back to a Trait Ontology definition, but not to a TRY trait. For leaf epidermis cell and bark thickness it's a matter of TRY IDs writing style. Also Bark thickness is written in two ways?! For plant lifespan, it's a link to a LEDA trait. Is this relevant here?

I've checked and these issue propagate to the RDF file.

ehwenk commented 4 months ago

@Rekyt Thank you for looking so closely at the table! We recently changed how the matches/examples in other databases were being read in to better deal with multiple values in a cell. I've added the necessary corrections to fix these particular mistakes to that branch - https://github.com/traitecoevo/APD/commit/6301aa0fdbe0cf84f621158875a55948f8ffa10b

(It is hard to merge multiple branches simultaneously editing a csv file, and the columns have been edited on this branch as well.)

dfalster commented 4 months ago

Thanks for the detailed example @Rekyt , very helpful!