Better distance for sentences using tokenization + better perf in deduplication of > 10K sentences dataset

markvanderloo / stringdist

String distance functions for R

318 stars 36 forks source link

Better distance for sentences using tokenization + better perf in deduplication of > 10K sentences dataset #24

Closed pommedeterresautee closed 9 years ago

pommedeterresautee commented 9 years ago

In python world, there is a very well known lib called fuzzywuzzy. A nice description of its features is available there.

I think stringdist already implements the basis to have similar features which would be very useful in two cases:

record linkage
deduplication

From my understanding, it just need a small layer on top of what is already existing in stringdist to provide a distance between sentences which is not word order sensitive for instance. Do you think it would be easy to implement?

Kind regards, Michael

pommedeterresautee commented 9 years ago

I just found 2 papers which can be of interest:

I don't guaranty the quality as I have not fully read them but will probably do it this week-end.

markvanderloo commented 9 years ago

Hi Michaël, thanks for your comments. I'm traveling these weeks, so I'll come back to you after that.

Cheers, Mark Op 27 feb. 2015 10:31 schreef "Michaël Benesty" notifications@github.com:

I just found 2 papers which can be of interest:

Extending String Similarity Join to Tolerant Fuzzy Token Matching* http://dbgroup.cs.tsinghua.edu.cn/ligl/papers/tods14-fastjoin.pdf.

Fast-Join: An Efficient Method for Fuzzy Token Matching based String Similarity Join http://www.cs.berkeley.edu/%7Ejnwang/projects/fastjoin/ .

I don't guaranty the quality as I have not fully read them but will probably do it this week-end.

— Reply to this email directly or view it on GitHub https://github.com/markvanderloo/stringdist/issues/24#issuecomment-76394396 .

pommedeterresautee commented 9 years ago

Wish you good traveling!

I will post my findings here so you can read it when you come back.

I read both papers, I didn't noticed first but they are from same authors. The first (2014) is an improvement on the second (2012).

An implementation of the second is available on Berkley servers.

The purpose is both comparing sentences (taking care of words) and limiting the complexity of the computation when you compare plenty of sentences together.

The way they do things is pretty smart, they use n grams to filter out strings which have nothing in common. It replaces the block rules of other methods I read about /tried. Papers are full of math proof I don't care but happy to know that they did it.

First paper (which is more recent) add a layer on the precedent, using itidf to increase weight of rare tokens in distance measures to make it smarter to perform fuzzy matching. I am very happy with that because it was my next step anyway (I was thinking to the way LDA algo is working). So papers are impressive and this team seems to have some experience in the area of fuzzy matching.

I will try the available source code tomorrow.

Kind regards, Michaël

pommedeterresautee commented 9 years ago

Time to share some news.

I have tried the code posted for Fast join. First, there is no source code, just binaries for Windows and Linux. I have contacted the author to see if he can share the C++ source code.

So I tried the exe and looked at the results. Comparing 30K strings (most being short sentences of 2/3 words) took less than 10mn depending on the parameters (from 1mn to 10). I used 0.6 in the two parameters,and Jaccard distance (meaning of parameters is explained in the paper). These parameters are pretty low, and that is what I wanted.

Result is a text file looking like that

0.6 8735 2013
mise en securite eclairage
mise en securite retourneuse

0.6 8735 1205
mise en securite eclairage
mise en place eclairage

0.6 8735 6870
mise en securite eclairage
mise en securite passerelle

0.6875 8736 8081
remise en conformite de la
mise en conformite de

0.709845 8736 6305
remise en conformite de la
mise en conformit de la ligne

First number is distance, two others are IDs.

I have written some code to clean the file and get a data.frame easy to read:

library(data.table)
library(magrittr)
library(igraph)

data <- readLines("./input/result.txt") %>% data.table(data = .) 

labels <- rep(c("Coeff","First", "Second", "empty"), times = nrow(data)/4)

data <- data[,label := labels][label!="empty"]

data2 <- data.table(First = data[label == "First", data], Second = data[label == "Second", data], Coeff = data[label == "Coeff", data])

Then I realized that it was a graph, A looks like B, B looks like E...

So I wrote some code to cluster the sentences:

g <- graph.data.frame(data2, directed=F, vertices=NULL) 
l <- clusters(g)

b <- data.table(n = unique(data2[,c(First, Second)]), c = l$membership)
c <- b[,.(n, .N), by = c]

In c data.table you have the size and the ID of the cluster for each sentence. The line to get b works but I am fully aware it s not a clean way to do it.

It s really dirty code, but it s just to let you have a look at your own data and see what kind of result you get.

IMHO it would be a nice add to your lib. The algo is powerfull, the paper is full of math (guaranty that it theoretically works) but not hard to understand (I am not a math guy), it works well without human input or pre trained data, there are only two parameters to guess (and 0.8 gives almost always good result). Let me know if you are interested. If not, don't worry I will stop to annoy you :-)

Just curious, I looked at your source code, is there a reason you are not using RCPP for the C++ part?

Kind regards, Michael

pommedeterresautee commented 9 years ago

I forgot to post a cluster example after applying the graph strategy

> c[N ==115]
      c                               n   N
  1: 51    reseau air comprime materiel 115
  2: 51    reseau air comprime fabricat 115
  3: 51  racord air comprime en reseaux 115
  4: 51     reseau air comprime ligne b 115
  5: 51    alimentation petrin grany en 115
 ---                                       
111: 51                      air compri 115
112: 51          devidoirs air comprime 115
113: 51            secheur air comprime 115
114: 51         collecteur air comprime 115
115: 51 air comprimé montage aménagemen 115

You can see that most of them share air comprime string (which is really what I wanted). It makes sense because the implementation use IDF to highlight rare n gram shared among tokens. Paper from 2014 goes deeper in this direction, it s the main add to the former but the exe available is from 2012 paper. I imagine 2014 results would be very impressive...

pommedeterresautee commented 9 years ago

To continue my previous messages, I have made some interesting progress.

Related to the way the parameters of Fast Join work, the delta have a high impact on the performance (it makes sense, this parameter define how many pairs of string are compared). I have found that you don t need to go below 0.8. When you just switch to 0.7 your time increase of >+1000% and you just add +10% new pairs (done with Tau = 0.4 which is very low).

Second, related to the way to mean the network of string, I have found some sub network embedding very different strings (mainly the 2 / 3 bigger groups). To limit that I have used the distance between pairs of strings as the weight of the edge between them in the network and I have used community finder algorithm. The idea is that in a sub network it is possible to have several communities where connections are more dense. Using that strategy even my biggest group are super homogeneous.

In igraph there are several strategies possible, the one I choose is the best for my dataset, rapid and give good quality results. From my understanding of the paper it is a kind of Gibbs applied to network.

To make all these ideas concrete, below is my new script (things are more clean than before):

library(data.table)
library(magrittr)
library(igraph)
library(stringr)

data <- readLines("./FixedAssetsOptimization/input/output.txt") %>% data.table(data = .) 

labels <- rep(c("Coeff","First", "Second", "empty"), times = nrow(data)/4)

data <- data[,label := labels][label!="empty"]

data2 <- data.table(First = data[label == "First", data], Second = data[label == "Second", data], Coeff = (data[label == "Coeff", data] %>% str_split_fixed(pattern = " ", n = 3) %>% .[,1]))

graph <- graph.data.frame(data2, directed=F, vertices=NULL)

#http://www.sixhat.net/finding-communities-in-networks-with-r-and-igraph.html
#http://bommaritollc.com/2012/06/summary-community-detection-algorithms-igraph-0-6/?utm_source=rss&utm_medium=rss&utm_campaign=summary-community-detection-algorithms-igraph-0-6
# steps parameter impact the size of the communities found
communities <- walktrap.community(graph, weights = data2[,Coeff], steps = 3)

clusters <- membership(communities)

b <- data.table(n = names(clusters), comm = as.numeric(clusters))
c <- b[,.(n, .N), by = comm]
unique(b[,comm]) %>% length
b[,.N,by = comm][,var(N)]
b[,.N,by = comm][,mean(N)]
b[,.N,by = comm][order(N, decreasing = T)][1:10]

# Weakest
# data2[order(Coeff)][1:10]

Nothing crazy but works well. I let you check when you will be back.

Kind regards, Michael

markvanderloo commented 9 years ago

Ok, there's a lot of questions and remarks here so let me answer them one by one.

Regarding the overhead in stringdistmatrix (raised in #23). This has been bothering me for some time and actually, it's not only the diagonal that causes overhead: we need only to compute the lower triangle if arguments a and b are the same. It has been in my head for a while and I've posted it as issue #25. I will probably first add it as a feature and then adapt the underlying C-code, pending updates on the C-interface I'm working on in the interface branch.

Regarding not using Rcpp (raised in #23). I am actually not using C++ but stick very strictly to the C99 standard (I often have a copy open of the latest rfc while I'm working). Let me elaborate why I'm not using C++ nor Rcpp -- after all I could use Rcpp with C99 code as well.

The reason I'm not using C++ is that (1) I do not need much of the OO-like features that cpp offers. I'm implementing fairly simple and basic algorithms, and I've only implemented a few basic structs. (2) debugging memory leaks using e.g. valgrind with c++ is harder than with C99 (i think because of the name-wrangling happening at C++ compilation). It's part a matter of taste and familiarity (I'm more familiar with C than with C++ although I have used C++ in the past) but also to a high degree an YAGNI issue. Moreover, C is faster than C++ although that shouldn't really matter in this case.

The reason I'm not using Rcpp is also a YAGNI issue. I think that the (truly awesome) RCpp package serves two main use cases:

1: one-of projects where you quickly want to speed up a few loops using C/C++ code or; 2: when you want to handle complex R-objects from C.

Neither of these apply to stringdist. stringdist is not a one-off project (so I need to carefully consider what [not] to use) and the arguments I'm passing to the underlying C-code are very simple: character vectors and a couple of numbers. In line of wanting to keep the package as light-weight as possible, I therefore chose not to use Rcpp. Note also that the stringi package which interfaces a large C++ library (UCI) also doesn't depend on Rcpp. I assume for more or less similar reasons.

I find the overhead of writing a few lines calling some macros from R not a probem at all. Having said that, the current code has grown a lot of R-interface code. Tha't's just because the number of distance functions has grown over time. As said above, I'm cleaning up the interface so adding extensions on C-level will become easier. I find this necessary also because its my ambition to at some point release the C-code as a library separate from R.

markvanderloo commented 9 years ago

Regarding record linkage and de-duplication I fully agree there. However, as (probabilistic) record matching is an extensive topic in itself, I feel it deserves a separate package that might depend on stringdist. The RecordLinkage which recently re-appeared on CRAN is dedicated to this, but implements at the moment only a single string distance (levensthein) and does not handle non-ascii encoding very well. By the way, Jan van der Laan wrote a nice blog probabilistic matching with an example here.

As I am a strong supporter of the unix ideal of 'one tool for one job, doing it really well', stringdist is aimed to do so. However, I feel there is room for a fuzzy join functionality in R on top of stringdist: think of an amerge, aunique, aduplicated. It should be carefully designed however, taking into account for example the increasing popularity of data.table objects (thanks, in part to their extensive promotion in RStudio's / Hadley's packages).

markvanderloo commented 9 years ago

Regarding tokenization I have not read the papers yet, so I have to get back to you on that (I am technically still enjoying a very nice holiday in Brazil at the moment :-) ). Most of the stuff I saw here from the fuzzywuzzy package should indeed be fairly easy to implement. Based on R's strsplit (or, better stringi's comparable functionality) combined with stringdist. I don't have enough of an overview yet to see if it should be part of stringdist or of a depending package.

pommedeterresautee commented 9 years ago

First, thanks a lot to take the time to answer during your holidays. Of course there is absolutely no urgency to answer to it in short time, I just post my ideas so you can follow my progress and ideas if you want to.

Regarding your points:

In https://github.com/markvanderloo/stringdist/issues/23 I was referring to computing the lower triangle too (and not just the zero diag) when I wrote In this square matrix, each information is two times (symmetric distances), but I have to say my sentence was closer to French than English, I have probably posting it very late or very early in the morning, sorry for that!!!
in all my posts it looks like I want to make stringdist a deduplication package like record linkage (btw this package is not Ok for me: too slow + not enough smart). It is not my intent. The papers I am referring to in my posts are just a way to optimize the way differences are computed. Like there is no reason to compute higher and lower triangles of the distance matrix because they are symmetric, the paper shows an algorithm to limit drastically the number of string comparisons. The idea basically is that you don't wan't to waste CPU time to compare two strings which have almost nothing in common. I let you discover the algorithm itself, but IMHO it s represents a powerful optimization of the already existing stringdistmatrix function.
the papers cover two things, limiting the number of comparisons and tokenization, so if you choose to implement the paper you would get the tokenization for free. It s also done in a very smart way (it s not that easy to decide which words is to compare with each word when you are not sure the two sentences share the same words).
unlike fuzzywoozy, the papers have some math proofs which says basically that the way things are done should always work.
the older of the two papers is pretty simple to understand and most of the low level bricks used in it are already implemented in stringdist (q-grams, jaquard...)
regarding Rcpp, I perfectly get your point. I love lambdas (I am more a Scala / Haskell boy than anything else) and thought this kind of algorithm was a perfect play ground for Rcpp11/Rcpp14 :-)

To finish, the things I have posted using graph theory are just a way to leverage string distances to perform a deduplication. I fully agree it should be in a separated package but it so basic right now that it may be even better to be a use case nicely presented in a dedicated Vignette of stringdist. I did that for another package dealing with machine learning. People seem to like this Vignette.

The link you posted about probabilistic matching is unfortunately not working.

To conclude, enjoy your holifays!!!!

Kind regards, Michael

markvanderloo commented 9 years ago

The deduplication is now built in the development version. I'm diverting support for integer vectors to #31.