mhoban / rainbow_bridge

GNU General Public License v3.0
5 stars 2 forks source link

collapse_taxonomy incorrectly joins in the `last_taxon` value #80

Closed mhoban closed 3 months ago

mhoban commented 3 months ago

When trying to find the taxid of the last identified taxonomic level, there is a problem where it's trying to do a one-to-one name match, but in actuality there are multiple possible names with different taxids (names in different kingdoms/domains, names of subgenera, etc.).

For now I think we should just remove this join step, but let's keep thinking about how to make it work.

The offending bit in collapse_taxonomy.R looks like this:

collapsed <- filtered %>%
  mutate(
    # first replace NAs with ... and dropped with NA (so coalesce works)
    across(domain:species,~replace(.x,which(is.na(.x)),"...")),
    across(domain:species,~replace(.x,which(.x == dropped),NA))
  ) %>%
  # now get the lowest non-NA taxonomic level
  mutate(last_level = coalesce(species,genus,family,order,class,phylum,kingdom,domain)) %>%
  # there is a problem when multiple names exist for something
  # maybe use the actual level name in the join, if we can somehow
  left_join(lineage,by=c("last_level" = "taxon")) %>% # <---- this is where it goes bad, because there's actually a many-to-many relationship but we're expecting one-to-one
  mutate(
    across(ends_with('_other'),~replace_na(.x,"")),
    across(domain:species,~replace(.x,which(.x == "..."),"")),
    across(domain:species,~replace_na(.x,dropped))
  )
mhoban commented 3 months ago

commented out for now