msberends / AMR

Functions to simplify and standardise antimicrobial resistance (AMR) data analysis and to work with microbial and antimicrobial properties by using evidence-based methods, as described in https://doi.org/10.18637/jss.v104.i03.
https://msberends.github.io/AMR/
Other
83 stars 12 forks source link

Group isolates #147

Open andrewjmc opened 4 months ago

andrewjmc commented 4 months ago

Hello,

Thank you for this tremendously useful package. It does so much out of the box and is already proving useful to me.

I love the flexibility afforded by the first_isolate function. However, for some analyses, it would be useful to know the "secondary" isolates which match with a first isolate. Given the way in which is_first_isolate is coded, it doesn't seem immediately obvious if a simple modification could allow a "group_isolates" function which returns a group number (1:n). This way it would be easy to count the number of times a given isolate (however defined) has been identified, and when.

Thanks for your thoughts.

Andrew

msberends commented 4 months ago

Hi Andrew,

Wow, many thanks for your very nice words! Great that it can be useful to you, and thanks for letting us explicitly know.

You are right - the outcome of first_isolate() is strictly dichotomous, returning TRUE or FALSE. But what you’re looking for is possible with the package. This function internally uses the function get_episode() (see manual here), which does return a sequence number of a patient’s isolates. The first isolate function and is_new_episode() essentially return TRUE for each new episode start, and FALSE otherwise.

In the documentation of the link I sent, this explains how the two function compare:

  x <- df %>%
    filter_first_isolate(
      include_unknown = TRUE,
      method = "episode-based"
    )

  y <- df %>%
    group_by(patient, mo) %>%
    filter(is_new_episode(date, 365)) %>%
    ungroup()

Does this help a bit? Perhaps get_episode() is what you’re after?

andrewjmc commented 4 months ago

Thanks. I had noted those functions, but they don't take the same range of arguments as the first_isolate function, e.g. phenotype based, etc. Is it possible to get groups with a similar degree of flexibility?

msberends commented 4 months ago

So given an episode length of 365 days, is what you're looking for the get_episode() outcome but then with all arguments available in first_isolate(), or are you looking for the last column that returns the sequence number of isolates within an episode?

patient date mo. get_episode() first_isolate() requested?
A 2019-01-01 E. coli 1 TRUE 1
A 2019-03-01 E. coli 1 FALSE 2
A 2021-01-01 E. coli 2 TRUE 1
B 2008-01-01 E. coli 1 TRUE 1
B 2008-01-01 E. coli 1 FALSE 2
C 2020-01-01 E. coli 1 TRUE 1
andrewjmc commented 4 months ago

The first if possible: get_episode() outcome but then with all arguments available in first_isolate() and I'm not averse to doing the work myself (just I didn't quickly grasp how the first isolate selection worked and I didn't spot the call to get_episode).

andrewjmc commented 4 months ago

Having further reviewed the code, and the CLSI M39 guideline, I now understand better that the first_isolate code is applying the various algorithms in a consecutive manner only. This actually means it should be straightforward to put isolates in groups based on the first_isolate result. For my own purpose, I am more interested in a non-consecutive approach for the purpose of summarising an individuals isolates (especially surveillance) to allow the easiest digestible summary of historical results. This is something I will work up myself, with distance based clustering of isolates. I am happy to close this issue if you don't otherwise feel get_episode should support the same options as first_isolate

msberends commented 4 months ago

I’d still like to learn more about this! Maybe it is something that should go in the package.

If you have something developed, would you share the code please? Perhaps then I understand better which angle you’re searching at.

andrewjmc commented 4 months ago

This is what I have implemented with the proxy library to give a custom distance function (S-R = 2, S-H and H-R = 1) cutting a tree at a distance of 2. Notably I think this means that connected isolates (all joined by distance <=2) would be grouped even though the distance between some pairs within would be >2.

Custom distance function for two paired vectors of antibiotics

ab_distance <- function(ab1, ab2){ mapply(function(a1, a2){ case_when( a1==a2 & a1 %in% c("S", "I", "H", "R") ~ 0, (a1=="S" & a2=="R") | (a1=="R" & a2=="S") ~ 2, a1%in%c("S", "R") & a2%in%c("I","H") ~ 1, .default=NA) }, ab1, ab2, SIMPLIFY=TRUE) %>% sum(na.rm=TRUE) }

Make the groups based on a data.frame of antibiotics for a given set of isolates

ab_groups <- function(d){ ab_cols <- lapply(d, function(x){"sir" %in% class(x)}) %>% unlist() if(nrow(d)>1){ distmat <- dist(d[,ab_cols], method=ab_distance, auto_convert_data_frames=FALSE) d$group<- hclust(distmat)%>% cutree(h=2) }else{ d$group<-1 } d }