Open allaway opened 1 year ago
Quick and dirty example:
library(stringdist)
library(dplyr)
library(tibble)
library(synapser)
synLogin()
foo <- synTableQuery('select distinct unnest(studyLeads) as pi from syn16787123')$asDataFrame()
dist <- stringdist::stringdistmatrix(foo$pi, method = "jw") %>%
as.matrix() %>%
as_tibble()
pheatmap::pheatmap(dist)
colnames(dist) <- foo$pi
dist["pi_1"] <- foo$pi
tidy_names <- tidyr::gather(dist, !contains("pi_1"), key = "pi_2", value = "dist")%>%
filter(dist != 0) %>%
arrange(dist)
Which yields:
Interestingly, one of the more prevalent issues appears to be trailing/leading whitespace, probably from older manual copy-pasting...
Anything above 0.2 j-w seems to be truly distinct, whereas <0.2 seems to deserve closer inspection.
Similar for institutions:
foo <- synTableQuery('select distinct unnest(institutions) as inst from syn16787123')$asDataFrame()
dist <- stringdist::stringdistmatrix(foo$inst, method = "jw") %>%
as.matrix() %>%
as_tibble()
pheatmap::pheatmap(dist)
colnames(dist) <- foo$inst
dist["inst_1"] <- foo$inst
tidy_names <- tidyr::gather(dist, !contains("inst_1"), key = "inst_2", value = "dist")%>%
filter(dist != 0) %>%
arrange(dist)
yields:
However, this isn't as easy to scan manually because of all of the high-similarity University of ... matches that really hide some of the true matches/values that need correction - can you spot them here? ;)
PI names in screenshot above have been standardized. I picked whichever one was more recent as the "standard."
We currently do not standardize PI or institution names. It would be helpful to do this on a semi-regular basis.
It would be great if we could have a function that flags similar strings in the Studies table, and add it as, say, a weekly or quarterly job. It would probably require manual intervention to actually fix the data.