mjwestgate / revtools

Tools to support research synthesis in R
https://revtools.net
48 stars 26 forks source link

Consider manual screening interface (i.e. text only) #12

Open mjwestgate opened 6 years ago

mjwestgate commented 6 years ago

revtools provides tools for visualising topic model information, but some users may wish (or be required) to sort articles based on titles or abstracts without including any visual information. A user interface for this would be simple to build, and would provide support for a wider range of users.

aornugent commented 5 years ago

Great talk yesterday Martin! This use-case was exactly what sprung to mind.

I wrote a quick function to aggregate and rank documents by similarity.

doc_rank <- function(lda, dtm, select = c(1), method = "term"){

  # Combine selected documents
  ngroup = length(select)
  if(ngroup > 1){
    group <- colSums(dtm[select, ])
    dtm[select, ] <- rep(group, each = ngroup)
  }

  # Back-transform LDA coefs.
  beta <- exp(lda@beta)

  # Weights docs by topic or term x topic
  if(method == "topic"){
    x <- dtm %*% t(beta)
  }
  else{
    w <- apply(dtm, 1, function(x) x * beta)
    x <- t(w)
  }

  # Calculate cosine dissimilarity
  c_dis <- 1 - x %*% t(x) / (sqrt(rowSums(x^2) %*% t(rowSums(x^2))))

  # Normalise across docs for symmetrical ranking (?desirable)
  d <- as.matrix(dist(c_dis))

  # Use first selected doc as reference point
  ref = select[1]

  # Rank documents
  doc_list <- data.frame(doc_id = 1:nrow(dtm), rank = rank(d[ref, ]))

  return(doc_list[order(doc_list$rank), ])
}

With a little tweaking to refine the action loop, a typical workflow might be:

Screen title, authors -> Read abstract -> Mark if relevant -> Sort document list.

which should hopefully bubble the relevant papers to the top.

library(revtools)

file_location <- system.file("extdata",
  "avian_ecology_bibliography.ris",
  package="revtools")

x <- read_bibliography(file_location)

d <- make_DTM(x)
l <- run_LDA(d)

# Doc 6 is the most similar to 1, Doc 16 the least.
doc_rank(l, d, c(1))

# But if I like Doc 16, I should read Doc 9 next.
doc_rank(l, d, c(16))
mjwestgate commented 5 years ago

Thanks Andrew, I'm glad you liked the talk! This is a great idea; my only caveats are how to:

  1. update this as the user selects more and more articles, and
  2. avoid biasing the user away from relevant research that uses different keywords

At the moment, my plan is to add a neural network -based method for prioritising articles in screen_titles or screw_abstracts, probably based on the approach of Roll et al. 2017 (https://onlinelibrary.wiley.com/doi/abs/10.1111/cobi.13044). But that won't be in v0.3.0 as I don't have time to test it right now!

Thanks heaps for the code too - this is a really good start that will help me out a lot.

aornugent commented 5 years ago

No problem, I was mostly just playing:

  1. The first block #Combine selected documents treats all selected documents as a single reference point. So you'd just update after every selection, or have a button to re-sort.

  2. This is harder. Back transforming the weights means that documents aren't strongly penalised for having a term that isn't associated with a topic, (beta ~ 0, instead of log_beta ~ -9; you could switch this if you wanted different behaviour). Pooling documents should capture a more diverse vocabulary as you progress and the overall similarity would tend towards the words different documents had in common. Serendipity is difficult code.

But this is far from tested! Be interesting to think about how you'd validate it.

edit: I wonder if you could substract irrelevant documents from the reference group? Not sure what that'd look like, but it might help narrow the search in a more granular manner.

befriendabacterium commented 5 years ago

FYI metagear's abstract screener does this already, albeit in a bit of a fiddly and inflexible way. But just to avoid duplicating that function.