traitecoevo / APCalign

R package for accessing, matching and updating species names of Australian flora
https://traitecoevo.github.io/APCalign/
Other
4 stars 6 forks source link

taxonomic_splits should have a `keep_taxonomic_splits` option #130

Closed ehwenk closed 1 year ago

ehwenk commented 1 year ago

The parameter taxonomic_splits requires an additional option, keep_taxonomic_splits.

This option would only maintain duplicate rows for a canonical name where there is truly ambiguity in which current canonical name is being referenced by a given aligned name. This would contrast with return_all which would return all rows where a synonym (or other taxonomic status) exists.

For instance, the once-upon-a-time taxon concept Acacia aneura has been split into 3 taxa, Acacia aneura, Acacia minyura and Acacia paraneura. However, there is one additional entry under the canonical name Acacia quadrimarginea, which is a misapplied use. There are also many synonyms where there is no ambiguity.

These should be separate outputs - we need to think about how to structure these and probably change how update_taxonomy functions, based on the desired output

ehwenk commented 1 year ago

see also issue #120

ehwenk commented 1 year ago

@falster, I don't think we're missing an option. splits are one-to-many merges, while alternate taxonomic status values are many-to-one joins. There isn't anything else to return. See latest comment in issue #120 and also:

# with `alternate taxonomic status`, a single accepted_name_usage matches to multiple canonical names
# this is a one-x-to-one-y join, because each canonical name (i.e. aligned name) has only a single accepted name
# so - I might be wrong - I don't think this should result in any propagation of rows
# so - there isn't a "return_all" option that is separate from "return_splits"
# please poke holes in this argument

  # "Selaginella australiensis" is a good example with 9 synonyms
# For this, collapsing `alternate_taxonomic_status_aligned` must also be performed on the `aligned_name`, not the `accepted_name`
# For instance, if the `aligned_name` is one of the taxonomic synonyms, the `taxonomic_status_aligned` is that synonym's `taxonomic_status`, while `taxonomic_status` is accepted, with no alternatives
# It is only if the `aligned_name` is already the `accepted_name` that it is appropriate to report alternate taxonomic status values (I think, happy to be hold I'm wrong)
# So I think, really, this is almost a mutate on `resources$APC` before/as it is being joined during update_taxonomy

collapsed_taxonomic_status <-
  resources$APC %>%
  dplyr::select(canonical_name, accepted_name_usage, accepted_name_usage_ID, taxon_ID, taxonomic_status) %>%
  dplyr::group_by(accepted_name_usage_ID) %>%
  dplyr::arrange(taxonomic_status) %>% ## XX replace with proper function with `my_order`
  dplyr::mutate(alternative_taxonomic_status_aligned = 
                  taxonomic_status %>% 
                  unique() %>% 
                  subset(., . != "accepted") %>% 
                  paste0(collapse = " | ") %>% 
                  dplyr::na_if("")
  ) %>%
  dplyr::slice(1) %>%
  dplyr::ungroup()

data %>%
  dplyr::left_join(
    by = "aligned_name",
    collapsed_taxonomic_status %>%
      rename(aligned_name = canonical_name) %>%
      select(
        aligned_name,
        alternative_taxonomic_status_aligned
      )
  )
ehwenk commented 1 year ago

@dfalster @wcornwell Can you run the code at the bottom of the comment and think about the following questions:

resources$APC %>%
  dplyr::mutate(
    accepted_name = resources$`APC list (accepted)`$canonical_name[match(accepted_name_usage_ID, resources$`APC list (accepted)`$accepted_name_usage_ID)]
  ) %>%
  dplyr::filter(species_and_infraspecific(taxon_rank)) %>%
  dplyr::filter(taxonomic_status != "excluded") %>%
  dplyr::select(canonical_name, accepted_name, accepted_name_usage_ID, taxon_ID, taxonomic_status, taxon_rank) %>% 
  dplyr::filter(canonical_name %in% c("Acacia aneura", "Acacia minyura", "Acacia paraneura") | 
                  accepted_name_usage_ID %in% c("https://id.biodiversity.org.au/node/apni/6707550","https://id.biodiversity.org.au/node/apni/2915027","https://id.biodiversity.org.au/node/apni/2914546")) %>% View()
wcornwell commented 1 year ago
Screenshot 2023-08-25 at 11 05 04 am
wcornwell commented 1 year ago

*Is there a reason that the row for Selaginella australiensis should document the taxonomic status of all the synonyms (& like) of names for which Selaginella australiensis is the accepted name?? **

You would want to do this if you were building a webpage for the species, like ALA and POWO must have done something like that to get this: https://powo.science.kew.org/taxon/urn:lsid:ipni.org:names:90399-3#synonyms or https://bie.ala.org.au/species/https://id.biodiversity.org.au/node/apni/2915027#names

But I'd argue it's beyond the scope of 99.99% (possibly 100%) of use cases for APCalign . I can't really think why you'd want to do for more than one name at once if you're not building a flora or flora-like resource. I sometimes will look up ALA or POWO to see the synonyms of a single name, but I can't imagine why I'd need to do that for 10 or 100 or 1000 names.

So I'd argue that is beyond (current) scope for this project.

ehwenk commented 1 year ago

But I'd argue it's beyond the scope of 99.99% (possibly 100%) of use cases for APCalign . I can't really think why you'd want to do for more than one name at once if you're not building a flora or flora-like resource. I sometimes will look up ALA or POWO to see the synonyms of a single name, but I can't imagine why I'd need to do that for 10 or 100 or 1000 names.

And we wouldn't actually be reporting the synonyms, just that there are synonyms effectively. So I'm leaving it out at this point. It isn't about aligning a name at all.

ehwenk commented 1 year ago

Closed by commit ac799c3