nf-osi / nfportalutils

Utilities for NF Portal project and data management
https://nf-osi.github.io/nfportalutils/
MIT License
2 stars 2 forks source link

Add function to flag similar strings #75

Open allaway opened 1 year ago

allaway commented 1 year ago

We currently do not standardize PI or institution names. It would be helpful to do this on a semi-regular basis.

It would be great if we could have a function that flags similar strings in the Studies table, and add it as, say, a weekly or quarterly job. It would probably require manual intervention to actually fix the data.

allaway commented 1 year ago

Quick and dirty example:

library(stringdist)
library(dplyr)
library(tibble)
library(synapser)
synLogin()

foo <- synTableQuery('select distinct unnest(studyLeads) as pi from syn16787123')$asDataFrame()

dist <- stringdist::stringdistmatrix(foo$pi, method = "jw") %>% 
  as.matrix() %>% 
  as_tibble()

pheatmap::pheatmap(dist)

colnames(dist) <- foo$pi
dist["pi_1"] <- foo$pi

tidy_names <- tidyr::gather(dist, !contains("pi_1"), key = "pi_2", value = "dist")%>% 
  filter(dist != 0) %>% 
  arrange(dist)

Which yields:

Screen Shot 2022-10-21 at 2 23 37 PM

Interestingly, one of the more prevalent issues appears to be trailing/leading whitespace, probably from older manual copy-pasting...

Anything above 0.2 j-w seems to be truly distinct, whereas <0.2 seems to deserve closer inspection.

allaway commented 1 year ago

Similar for institutions:

foo <- synTableQuery('select distinct unnest(institutions) as inst from syn16787123')$asDataFrame()

dist <- stringdist::stringdistmatrix(foo$inst, method = "jw") %>% 
  as.matrix() %>% 
  as_tibble()

pheatmap::pheatmap(dist)

colnames(dist) <- foo$inst
dist["inst_1"] <- foo$inst

tidy_names <- tidyr::gather(dist, !contains("inst_1"), key = "inst_2", value = "dist")%>% 
  filter(dist != 0) %>% 
  arrange(dist)

yields:

Screen Shot 2022-10-21 at 2 30 22 PM

However, this isn't as easy to scan manually because of all of the high-similarity University of ... matches that really hide some of the true matches/values that need correction - can you spot them here? ;)

Screen Shot 2022-10-21 at 2 30 39 PM
allaway commented 1 year ago

PI names in screenshot above have been standardized. I picked whichever one was more recent as the "standard."