taxonomic_splits should have a `keep_taxonomic_splits` option

ehwenk commented 1 year ago

The parameter taxonomic_splits requires an additional option, keep_taxonomic_splits.

This option would only maintain duplicate rows for a canonical name where there is truly ambiguity in which current canonical name is being referenced by a given aligned name. This would contrast with return_all which would return all rows where a synonym (or other taxonomic status) exists.

For instance, the once-upon-a-time taxon concept Acacia aneura has been split into 3 taxa, Acacia aneura, Acacia minyura and Acacia paraneura. However, there is one additional entry under the canonical name Acacia quadrimarginea, which is a misapplied use. There are also many synonyms where there is no ambiguity.

These should be separate outputs - we need to think about how to structure these and probably change how update_taxonomy functions, based on the desired output

ehwenk commented 1 year ago

see also issue #120

ehwenk commented 1 year ago

@falster, I don't think we're missing an option. splits are one-to-many merges, while alternate taxonomic status values are many-to-one joins. There isn't anything else to return. See latest comment in issue #120 and also:

# with `alternate taxonomic status`, a single accepted_name_usage matches to multiple canonical names
# this is a one-x-to-one-y join, because each canonical name (i.e. aligned name) has only a single accepted name
# so - I might be wrong - I don't think this should result in any propagation of rows
# so - there isn't a "return_all" option that is separate from "return_splits"
# please poke holes in this argument

  # "Selaginella australiensis" is a good example with 9 synonyms
# For this, collapsing `alternate_taxonomic_status_aligned` must also be performed on the `aligned_name`, not the `accepted_name`
# For instance, if the `aligned_name` is one of the taxonomic synonyms, the `taxonomic_status_aligned` is that synonym's `taxonomic_status`, while `taxonomic_status` is accepted, with no alternatives
# It is only if the `aligned_name` is already the `accepted_name` that it is appropriate to report alternate taxonomic status values (I think, happy to be hold I'm wrong)
# So I think, really, this is almost a mutate on `resources$APC` before/as it is being joined during update_taxonomy

collapsed_taxonomic_status <-
  resources$APC %>%
  dplyr::select(canonical_name, accepted_name_usage, accepted_name_usage_ID, taxon_ID, taxonomic_status) %>%
  dplyr::group_by(accepted_name_usage_ID) %>%
  dplyr::arrange(taxonomic_status) %>% ## XX replace with proper function with `my_order`
  dplyr::mutate(alternative_taxonomic_status_aligned = 
                  taxonomic_status %>% 
                  unique() %>% 
                  subset(., . != "accepted") %>% 
                  paste0(collapse = " | ") %>% 
                  dplyr::na_if("")
  ) %>%
  dplyr::slice(1) %>%
  dplyr::ungroup()

data %>%
  dplyr::left_join(
    by = "aligned_name",
    collapsed_taxonomic_status %>%
      rename(aligned_name = canonical_name) %>%
      select(
        aligned_name,
        alternative_taxonomic_status_aligned
      )
  )

ehwenk commented 1 year ago

@dfalster @wcornwell Can you run the code at the bottom of the comment and think about the following questions:

What are we actually trying to document with the field alternative_taxonomic_status_aligned that is different to what we're documenting with splits/most likely species/collapses? I'm going in circles, seeing them are distinct vs near-identical concepts. The only place they are different would be alternative_taxonomic_status_aligned includes misapplied & excluded
With Selaginella australiensis if the aligned_name is Selaginella australiensis there is no ambiguity in the taxonomic_status of the aligned_name; it is accepted. Same with all the synonyms - Selaginella leptostachya simply is a taxonomic_synonym of Selaginella australiensis, which is accepted.
[ ] Is there a reason that the row for Selaginella australiensis should document the taxonomic status of all the synonyms (& like) of names for which Selaginella australiensis is the accepted name??
With Acacia aneura if the aligned_nameisAcacia aneura, the ambiguity in whether this is trulyAcacia aneuraor insteadAcacia paraneuraorAcacia minyurais documented with the columns about alternative accepted names (i.e. splits). And these also document the taxonomic status of the alternative names. It is true there is alsoAcacia anuerathat has beenmisappliedtoAcacia quadrimarginea. Maybe this is part of thealternative_accepted_namescolumn for themost_likely_speciesoption, butAcacia quadrimargineais excluded from the list ofreturn_all`.
With Acacia minyura, if the aligned_name is Acacia minyura, there is no ambiguity in the taxonomic_status of the aligned_name; it is accepted. It isn't pro parte misapplied, so the row indicating that a plant identified as Acacia aneura might actually be Acacia minyura should cause Acacia minyura to have pro parte misapplied added as an alternative_taxonomic_status_aligned. As for its synonyms, that is the same as the Selaginella example above. So no alternative_taxonomic_status_aligned values should be added to Acacia minyura .
[ ] What am I missing??

resources$APC %>%
  dplyr::mutate(
    accepted_name = resources$`APC list (accepted)`$canonical_name[match(accepted_name_usage_ID, resources$`APC list (accepted)`$accepted_name_usage_ID)]
  ) %>%
  dplyr::filter(species_and_infraspecific(taxon_rank)) %>%
  dplyr::filter(taxonomic_status != "excluded") %>%
  dplyr::select(canonical_name, accepted_name, accepted_name_usage_ID, taxon_ID, taxonomic_status, taxon_rank) %>% 
  dplyr::filter(canonical_name %in% c("Acacia aneura", "Acacia minyura", "Acacia paraneura") | 
                  accepted_name_usage_ID %in% c("https://id.biodiversity.org.au/node/apni/6707550","https://id.biodiversity.org.au/node/apni/2915027","https://id.biodiversity.org.au/node/apni/2914546")) %>% View()

wcornwell commented 1 year ago

wcornwell commented 1 year ago

*Is there a reason that the row for Selaginella australiensis should document the taxonomic status of all the synonyms (& like) of names for which Selaginella australiensis is the accepted name?? **

You would want to do this if you were building a webpage for the species, like ALA and POWO must have done something like that to get this: https://powo.science.kew.org/taxon/urn:lsid:ipni.org:names:90399-3#synonyms or https://bie.ala.org.au/species/https://id.biodiversity.org.au/node/apni/2915027#names

But I'd argue it's beyond the scope of 99.99% (possibly 100%) of use cases for APCalign . I can't really think why you'd want to do for more than one name at once if you're not building a flora or flora-like resource. I sometimes will look up ALA or POWO to see the synonyms of a single name, but I can't imagine why I'd need to do that for 10 or 100 or 1000 names.

So I'd argue that is beyond (current) scope for this project.

ehwenk commented 1 year ago

But I'd argue it's beyond the scope of 99.99% (possibly 100%) of use cases for APCalign . I can't really think why you'd want to do for more than one name at once if you're not building a flora or flora-like resource. I sometimes will look up ALA or POWO to see the synonyms of a single name, but I can't imagine why I'd need to do that for 10 or 100 or 1000 names.

And we wouldn't actually be reporting the synonyms, just that there are synonyms effectively. So I'm leaving it out at this point. It isn't about aligning a name at all.

ehwenk commented 1 year ago

Closed by commit ac799c3

traitecoevo / APCalign

taxonomic_splits should have a `keep_taxonomic_splits` option #130