quanteda / quanteda.textstats

Textual statistics for quanteda
GNU General Public License v3.0
14 stars 2 forks source link

Introduce edit distance based measures of text similarity ? #17

Open jiongweilua opened 5 years ago

jiongweilua commented 5 years ago

Requested feature

@stefan-mueller and I were wondering whether we should introduce edit distance based measures of text similarity to quanteda.

Stefan shared with me a paper by James P. Cross and Henrik Hermansson (2017) here: they document the Levenshtein Distance, DocuToads distance, and

I was wondering if such measures would be a meaningful addition to quanteda / how widely such measures are used? @kbenoit

The Levenshtein Distance is very well-documented with lots of existing implementations (e.g. stringdist) We can make it quanteda friendly and I should not have too much trouble to implementing it.

The DocuToads distance is a bit more challenging, but the online appendix to the Cross and Hermansson paper is very detailed; with enough time, I should be able to work through them.

EDIT: Another paper of interest is by Wilkerson et al. (2015) here

kbenoit commented 5 years ago

The idea of edit distances for token vectors is an interesting one and could have some substantive applications such as in the EUP paper. I read the paper, the supplemental information, and the replication file, and unfortunately there is no code showing how they compute the dataset - in other words, how they actually implement "Docutoads". There is https://github.com/ajhhermansson/DocuToads but it's basically a pair of Python scripts.

The idea that edit distances for individual words means anything substantive is just flat wrong. The problem in English is just the opposite: too many words that are homographs with totally different meanings.

There is already the package textreuse which unfortunately reimplements the NLP foundations that quanteda is so well developed for, but since it only needs to do one thing, that's probably ok.

In short: I'm open-minded but a bit skeptical.

stefan-mueller commented 5 years ago

I received several emails and requests in the last couple of weeks as to whether these algorithms are included in quanteda or other R packages. So there seems to be demand for this (also based on the citations for the Wilkerson et al (2015) paper).

textreuse does include the Smith-Waterman algorithm, but align_local() does not give the scores for the entire document-pair, but only reports the longest alignment score. Since we have added measures on the tokens level, it might make sense to add at least the Smith-Waterman algorithm. Adding something like that to quanteda could be useful (and this implementation is not based edit distances for individual words). What's your opinion?

library(textreuse)
align_local("The answer is blowin' in the wind.",
            "As the Bob Dylan song says, the answer was blowing in the wind.")
#> TextReuse alignment
#> Alignment score: 8 
#> Document A:
#> The answer is ### blowin ####### in the wind
#> 
#> Document B:
#> the answer ## was ###### blowing in the wind
kbenoit commented 5 years ago

OK, what would the function signature be called and what inputs and arguments would it need, and what would the output look like? I assume it would be a textstat_*() function.

koheiw commented 5 years ago

I was also interested in positional distance/similarity measures such as Hamming (which we removed from textstat_dist) and Levenshtein distance. These measures computationally intensive, so we need to write using RcppParalell, but we can start prototyping in R.

kbenoit commented 5 years ago

This might be interesting: https://stackoverflow.com/questions/55352498/how-to-output-in-r-all-possible-deviations-of-a-word-for-a-fixed-distance-value

utils::adist() is very fast. But string distances for tokens are semantically meaningless and generally useless unless the task is something very practical such as matching proper names that might have alternative spellings. Matching edit distances for texts with tokens as the elements should be our focus.