ropensci / textreuse

Detect text reuse and document similarity
https://docs.ropensci.org/textreuse
197 stars 33 forks source link

Parallelize lsh_compare() #69

Open lmullen opened 9 years ago

retrography commented 4 years ago

Would it be possible to parallelize lsh_compare? On large corpora the number of comparisons can quickly become very big.

I managed to do this using %dopar% and bind_rows from dplyr, but I assume there are other way to do it as well:

lshc <- function (candidates, corpus, f)
{
    num_rows <- nrow(candidates)
    bind_rows(
    foreach (i=seq_len(num_rows)) %dopar% {
        a <- candidates$a[i]
        b <- candidates$b[i]
        score <- f(corpus[[a]], corpus[[b]])
        list(a = a, b = b, score = score)
    })
}

Then I noticed that you have already used mclapply in TextReuseCorpus. So, maybe the same can be done for lsh_compare? Let me know if I can help with that.