msberends / AMR

Functions to simplify and standardise antimicrobial resistance (AMR) data analysis and to work with microbial and antimicrobial properties by using evidence-based methods, as described in https://doi.org/10.18637/jss.v104.i03.
https://msberends.github.io/AMR/
Other
83 stars 12 forks source link

mo_fullname("P. aeroginosa") does not return mo with highest matching score #54

Closed vrognas closed 1 year ago

vrognas commented 2 years ago

Hi,

I have just discovered this package and started to play with it; looks very promising and would like to thank you for this contribution.

I have a dataset with bacteria coded as strings, and one string is obviously misspelled: "P. aeroginosa" instead of "P. aeruginosa". The spelling misstake is made in 2/7 (~20%) of cases.

When I use as.mo(), "P. aeroginosa" assumes Pasteurella aerogenes, but also a helpful message on uncertainty. For "P. aeruginosa", as.mo() correctly assumes Pseudomonas aeruginosa:

> as.mo("P. aeroginosa")
ℹ Function `as.mo()` is uncertain about "P. aeroginosa" (assuming Pasteurella
  aerogenes). Run `mo_uncertainties()` to review this.
Class <mo>
[1] B_PSTRL_AERG
> as.mo("P. aeruginosa")
Class <mo>
[1] B_PSDMN_AERG

However, it is my understanding that the organism with the highest matching score that should be returned. When I check, both the misspelled and correctly spelled string returns Pseudomonas aeruginosa as highest matching score:

> tibble::tibble(
...     p = AMR::mo_matching_score("P. aeroginosa", microorganisms$fullname),
...     name = microorganisms$fullname
... ) |>
...     dplyr::arrange(desc(p))
# A tibble: 70,760 × 2
       p name                    
   <dbl> <chr>                   
 1 0.75  Pseudomonas aeruginosa  
 2 0.719 Vibrio aerogenes        
 3 0.714 Paraferrimonas          
 4 0.714 Pseudaeromonas          
 5 0.690 Pasteurella aerogenes   
 6 0.688 Alteromonas alba        
 7 0.688 Psychrobacter glacincola
 8 0.682 Alteromonas             
 9 0.682 Lysobacter spongiicola  
10 0.679 Panacagrimonas          
# … with 70,750 more rows
> tibble::tibble(
...     p = AMR::mo_matching_score("P. aeruginosa", microorganisms$fullname),
...     name = microorganisms$fullname
... ) |>
...     dplyr::arrange(desc(p))
# A tibble: 70,760 × 2
       p name                     
   <dbl> <chr>                    
 1 0.773 Pseudomonas aeruginosa   
 2 0.714 Paraferrimonas           
 3 0.714 Salsuginimonas           
 4 0.706 Paraperlucidibaca        
 5 0.688 Psychrobacter glacincola 
 6 0.688 Vibrio aerogenes         
 7 0.68  Pseudomonas pertucinogena
 8 0.679 Panacagrimonas           
 9 0.679 Paraglaciecola           
10 0.679 Pseudaeromonas           
# … with 70,750 more rows
> 

This means that I would expect as.mo() (and mo_fullname()) to return Pseudomonas aeruginosa in both cases. However, they do not – how come?

Thanks.

msberends commented 2 years ago

Hey, thanks for using the package!

You are absolutely right! I’ll look into this issue. It should definitely return Pseudomonas.

msberends commented 2 years ago

The algorithm behind as.mo() does some pre-matching first, so not all 70,000 microorganisms need to have a matching score calculated for each input value. I'll think of a solution of this problem.

For now, here's a workaround. You can use the reference_df argument in as.mo() and any mo_*() function by passing on a data set with your 'errors':

# this is wrong indeed:
mo_name(c("P. aeruginosa", "P. aeroginosa"))
#> i Function `as.mo()` is uncertain about "P. aeroginosa" (assuming Pasteurella aerogenes). Run `mo_uncertainties()` to review this.
#> [1] "Pseudomonas aeruginosa" "Pasteurella aerogenes" 

# with the 'reference_df' argumemt, we can fix this for now - let's lookup the right ID of this Pseudomonas:
as.mo("Pseudomonas aeruginosa")
#> Class <mo>
#> [1] B_PSDMN_AERG

# use this as info for 'reference_df' (which accepts a data frame):
mo_name(c("P. aeruginosa", "P. aeroginosa"),
        reference_df = data.frame(old = "P. aeroginosa",
                                  mo = "B_PSDMN_AERG"))
#> [1] "Pseudomonas aeruginosa" "Pseudomonas aeruginosa

# yeej!

# even easier: use as.mo() in reference_df itself, if you have a 100% certain name:
mo_name(c("P. aeruginosa", "P. aeroginosa"),
        reference_df = data.frame(old = "P. aeroginosa",
                                  mo = as.mo("Pseudomonas aeruginosa")))
#> [1] "Pseudomonas aeruginosa" "Pseudomonas aeruginosa"

This process can be automated by using an mo source for the package. In the online manual, you can find that reference_df at default runs get_mo_source(). Using this method, you only need to define the errors once in a text or Excel file, and the mo functions of the package will pick them up! So I would suggest for now to read about the mo source functions and try that out.

vrognas commented 2 years ago

Thank you for the quick response and elegant workaround! 👍🏼

msberends commented 1 year ago

Fixed in #71, which implements a completely new MO interpretation algorithm. You can test it with the following command, but please be aware that it's a beta version:

install.packages("remotes") # if you haven't already
remotes::install_github("msberends/AMR")

If you want to revert to the latest release (1.8.2), you can just do:

install.packages("AMR")
msberends commented 1 year ago
image