improvements - Githubissues

jwijffels commented 3 years ago

I think it would be beneficial to have the possibility to have a predict function as well. This would basically mean that you decompose the function in a

data preparation function (text cleaning)
the output of the function can contain the doc2vec model, umap model and hdbscan
use predict.doc2vec and predict.umap (not sure if there is a predict.hdbscan)

jwijffels commented 3 years ago

I wouldn't mind working on this as I need to have such a model running on some production systems. Would you mind adding me as a developer and provide me rights? Note that I tend to follow a non-tidyverse logic, so that might have an impact on your current setup.

jwijffels commented 3 years ago

thx

michalovadek commented 3 years ago

happy to do that! my ability to work on this fluctuates a lot but I would like to make this into a CRAN submission somewhere down the line

I use a lot of tidyverse functions but I have no issue with using base or other packages. It might in any case be necessary to accommodate data types from key dependencies, so the code in the end will be a mix of all sorts of approaches, which is fine

jwijffels commented 3 years ago

I've completely rewritten this as follows: moving out the doc2vec step which seems to me more something that does not need to be run each time. I think being able to work around the limitation of dbscan by allowing to cluster only a set of documents would be beneficial. Also having the option of another clustering algorithm like this one: https://cran.r-project.org/web/packages/opticskxi/vignettes/opticskxi.pdf I'm thinking about adding this to the doc2vec package instead.

#' @title Distributed Representations of Topics
#' @description Perform text clustering by using semantic embeddings of documents and words
#' to find topics in texts which are semantically similar.
#' @param x either an object returned by \code{\link[doc2vec]{paragraph2vec}} or a data.frame 
#' with columns `doc_id` and `text` storing document ids and texts as character vectors.
#' @param control.umap a list of arguments to pass on to \code{\link[uwot]{umap}} for reducing the dimensionality of the embedding space
#' @param control.dbscan a list of arguments to pass on to \code{\link[dbscan]{hdbscan}} for clustering the reduced embedding space
#' @param control.doc2vec optionally, a list of arguments to pass on to \code{\link[doc2vec]{paragraph2vec}} in case \code{x} is a data.frame
#' instead of a doc2vec model trained by \code{\link[doc2vec]{paragraph2vec}}
#' @param umap function to apply UMAP. Defaults to \code{\link[uwot]{umap}}, an as well be \code{\link[uwot]{tumap}}
#' @param trace logical indicating to print evolution of the algorithm
#' @param ... further arguments not used yet
#' @export
#' @references \url{https://arxiv.org/abs/2008.09470}
#' @seealso \code{\link[doc2vec]{paragraph2vec}}
#' @return an object of class \code{top2vec} which is a list with elements
#' \itemize{
#' \item{embedding: a list of matrices with word and document embeddings}
#' \item{doc2vec: a doc2vec model}
#' \item{umap: a matrix of representations of the documents of \code{x}}
#' \item{dbscan: the result of the hdbscan clustering}
#' }
#' @examples 
#' library(word2vec)
#' library(doc2vec)
#' library(uwot)
#' data(be_parliament_nl, package = "top2vec")
#' x      <- data.frame(doc_id = be_parliament_nl$doc_id,
#'                      text   = be_parliament_nl$text,
#'                      stringsAsFactors = FALSE)
#' x$text <- txt_clean_word2vec(x$text)
#' x      <- head(x, n = 10000)
#' d2v    <- paragraph2vec(x, type = "PV-DBOW", dim = 50, 
#'                         lr = 0.05, iter = 10,
#'                         window = 15, hs = TRUE, negative = 0,
#'                         sample = 0.00001, min_count = 5, 
#'                         threads = 1)
#' # write.paragraph2vec(d2v, "dev/d2v.bin")
#' # d2v    <- read.paragraph2vec("dev/d2v.bin")
#' model  <- top2vec(d2v, 
#'                   control.dbscan = list(minPts = 50), 
#'                   control.umap = list(n_neighbors = 15L, n_components = 2), trace = TRUE)
#' model  <- top2vec(d2v, 
#'                   control.dbscan = list(minPts = 50), 
#'                   control.umap = list(n_neighbors = 15L, n_components = 3), umap = tumap, 
#'                   trace = TRUE)
#'                                   
#' info   <- summary(model, top_n = 5)
#' info$topwords
#' 
#' ## Change the model: reduce doc2vec model to 2D
#' model  <- update(model, type = "umap", 
#'                  n_neighbors = 100, n_components = 2, metric = "cosine", umap = tumap, 
#'                  trace = TRUE)
#' ## Change the model: have minimum 200 points for the core elements in the hdbscan density
#' model  <- update(model, type = "hdbscan", 
#'                  minPts = 200, trace = TRUE)
top2vec <- function(x, 
                    control.umap = list(n_neighbors = 15L, n_components = 5L, metric = "cosine"), 
                    control.dbscan = list(minPts = 100L), 
                    control.doc2vec = list(), 
                    umap = uwot::umap,
                    trace = FALSE, ...){
  stopifnot(inherits(x, c("data.frame", "paragraph2vec", "paragraph2vec_trained")))
  if(inherits(x, "data.frame")){
    stopifnot(all(c("doc_id", "text") %in% colnames(x)) && is.character(x$text))
    control.doc2vec$x <- x
    if(trace){
      cat(sprintf("%s building doc2vec model", Sys.time()), sep = "\n")
    }
    model <- do.call(doc2vec::paragraph2vec, control.doc2vec)
  }else{
    stopifnot(inherits(x, c("paragraph2vec", "paragraph2vec_trained")))
    model <- x
  }
  if(trace){
    cat(sprintf("%s extracting doc2vec embeddings", Sys.time()), sep = "\n")
  }
  embedding_docs  <- as.matrix(model, which = "docs")
  embedding_words <- as.matrix(model, which = "words")
  ## UMAP
  if(trace){
    cat(sprintf("%s performing UMAP dimensionality reduction on the doc2vec embedding space", Sys.time()), sep = "\n")
  }
  control.umap$X   <- embedding_docs
  embedding_umap   <- do.call(umap, control.umap)
  ## HDBSCAN
  if(trace){
    cat(sprintf("%s performing HDBSCAN density based clustering", Sys.time()), sep = "\n")
  }
  control.dbscan$x <- embedding_umap
  clusters         <- do.call(dbscan::hdbscan, control.dbscan)
  out <- structure(list(embedding = list(words = embedding_words, docs = embedding_docs),
                        doc2vec = model, 
                        umap = embedding_umap,
                        umap_FUN = umap,
                        dbscan = clusters,
                        size = table(clusters$cluster),
                        k = length(unique(model$dbscan$cluster)),
                        control = list(doc2vec = control.doc2vec, umap = control.umap, dbscan = control.dbscan)), 
                   class = "top2vec")
  out
}

#' @title Update a Top2vec model
#' @description Update a Top2vec model by updating the 
#' @param object an object of class \code{top2vec} as returned by \code{\link{top2vec}}
#' @param type a character string indicating what to udpate. Either 'umap' or 'hdbscan' where the former (type = 'umap') indicates to 
#' update the umap as well as the hdbscan procedure and the latter (type = 'hdbscan') indicates to update only the hdbscan step.
#' @param umap see \code{umap} argument in \code{\link{top2vec}}
#' @param trace logical indicating to print evolution of the algorithm
#' @param ... further arguments either passed on to \code{\link[dbscan]{hdbscan}} in case type is 'hdbscan' or to \code{\link[uwot]{umap}}
#' in case type is 'umap'
#' @return an updated top2vec object
#' @export
update.top2vec <- function(object, type = c("umap", "hdbscan"), umap = object$umap_FUN, trace = FALSE, ...){
  type <- match.arg(type)
  if(type == "umap"){
    t2vec <- top2vec(object$doc2vec, control.umap = list(...), control.dbscan = object$control$dbscan, trace = trace, umap = umap)
  }else if(type == "hdbscan"){
    if(trace){
      cat(sprintf("%s performing HDBSCAN density based clustering", Sys.time()), sep = "\n")
    }
    t2vec <- object
    t2vec$dbscan <- dbscan::hdbscan(object$umap, ...)
  }
  t2vec
}

#' @title Basic text clearning
#' @description Basic text clearning
#' @param x a character vector to clean
#' @param type either 'default' or 'stringr'
#' @export
#' @return the character vector, leading/trailing spaces removed as well as punctuations and digits
#' @examples 
#' x <- c("some . tests   abc 90210", "++was  yes tokenisation  engine123")
#' top2vec_standardise(x, type = "default")
#' top2vec_standardise(x, type = "stringr")
top2vec_standardise <- function(x, type = c("default", "stringr")){
  type <- match.arg(type)
  if(requireNamespace("stringr") && !type %in% "default"){
    x <- stringr::str_squish(x)
    x <- stringr::str_trim(x)
    x <- stringr::str_remove_all(x, "[[:punct:][:digit:]]+")
    x <- stringr::str_to_lower(x)  
  }else{
    x <- trimws(x)
    x <- gsub("[[:space:]]+", " ", x)
    x <- gsub("[[:punct:][:digit:]]+", " ", x)
    x <- trimws(x)
    x <- gsub("[[:space:]]+", " ", x)
    x <- tolower(x)
  }
  x
}

#' @title Get summary information of a top2vec model
#' @description Get summary information of a top2vec model
#' @export
summary.top2vec <- function(object, type = "", top_n = 10, ...){
  topic_idx       <- split(x = seq_along(object$dbscan$cluster), f = object$dbscan$cluster)
  topic_centroids <- lapply(topic_idx, FUN = function(i) colMeans(object$embedding$docs[i, , drop = FALSE]))
  topic_medoids   <- lapply(topic_idx, FUN = function(i) apply(object$embedding$docs[i, , drop = FALSE], MARGIN = 2, FUN = median))
  #topic_medoids <- do.call(rbind, topic_medoids)
  topwords <- lapply(topic_centroids, FUN = function(topic){
    doc2vec::paragraph2vec_similarity(y = object$embedding$words, x = topic, top_n = top_n)
  })
  list(topwords = topwords, centroids = topic_centroids, medoids = topic_medoids)
}

michalovadek commented 3 years ago

go ahead, I had to go back to Python for my applications due to 1) having more than 50 000 docs; 2) having docs longer than 1000 words; 3) and needing to apply additional hierarchical clustering, so I think it might be a while before I could get an R version that would do the same. There is also the more long-term problem of supporting pre-trained transformer models, which I suspect will become even more important with time

jwijffels commented 3 years ago

I've added the implementation to the R doc2vec package: https://github.com/bnosac/doc2vec

michalovadek / top2vecr

improvements #1