When trying to find the taxid of the last identified taxonomic level, there is a problem where it's trying to do a one-to-one name match, but in actuality there are multiple possible names with different taxids (names in different kingdoms/domains, names of subgenera, etc.).
For now I think we should just remove this join step, but let's keep thinking about how to make it work.
The offending bit in collapse_taxonomy.R looks like this:
collapsed <- filtered %>%
mutate(
# first replace NAs with ... and dropped with NA (so coalesce works)
across(domain:species,~replace(.x,which(is.na(.x)),"...")),
across(domain:species,~replace(.x,which(.x == dropped),NA))
) %>%
# now get the lowest non-NA taxonomic level
mutate(last_level = coalesce(species,genus,family,order,class,phylum,kingdom,domain)) %>%
# there is a problem when multiple names exist for something
# maybe use the actual level name in the join, if we can somehow
left_join(lineage,by=c("last_level" = "taxon")) %>% # <---- this is where it goes bad, because there's actually a many-to-many relationship but we're expecting one-to-one
mutate(
across(ends_with('_other'),~replace_na(.x,"")),
across(domain:species,~replace(.x,which(.x == "..."),"")),
across(domain:species,~replace_na(.x,dropped))
)
When trying to find the taxid of the last identified taxonomic level, there is a problem where it's trying to do a one-to-one name match, but in actuality there are multiple possible names with different taxids (names in different kingdoms/domains, names of subgenera, etc.).
For now I think we should just remove this join step, but let's keep thinking about how to make it work.
The offending bit in
collapse_taxonomy.R
looks like this: