Please cite the R-Journal article
@article{RJ-2014-011,
author = {Mark P.J. van der Loo},
title = {{The stringdist Package for Approximate String Matching}},
year = {2014},
journal = {{The R Journal}},
doi = {10.32614/RJ-2014-011},
url = {https://doi.org/10.32614/RJ-2014-011},
pages = {111--122},
volume = {6},
number = {1}
}
The package offers the following main functions:
stringdist
computes pairwise distances between two input character vectors (shorter one is recycled)stringdistmatrix
computes the distance matrix for one or two vectorsstringsim
computes a string similarity between 0 and 1, based on stringdist
amatch
is a fuzzy matching equivalent of R's native match
functionain
is a fuzzy matching equivalent of R's native %in%
operatorafind
finds the location of fuzzy matches of a short string in a long string.seq_dist
, seq_distmatrix
, seq_amatch
and seq_ain
for distances between, and matching of integer sequences. (see also the hashr package).These functions are built upon C
-code that re-implements some common (weighted) string
distance functions. Distance functions include:
Also, there are some utility functions:
qgrams()
tabulates the qgrams in one or more character
vectors.seq_qrams()
tabulates the qgrams (somtimes called ngrams) in one or more integer
vectors.phonetic()
computes phonetic codes of strings (currently only soundex)printable_ascii()
is a utility function that detects non-printable ascii or non-ascii characters.As of version 0.9.5.0
you can call a number of stringdist
functions directly
from the C
code of your R package. The description of the API can be found
?stringdist_api
in the R consoleUser guides, package vignettes and other documentation
and clicking on doc/stringdist_api.pdf
.system.file("doc/stringdist_api.pdf", package="stringdist")
Examples of packages that link to stringdist
can be found here and
here.
To install the latest release from CRAN, open an R terminal and type
install.packages('stringdist')
To obtain the package from the very latest source code open a bash
terminal (or git bash
if you work under Windows
with msysgit
) and type
git clone https://github.com/markvanderloo/stringdist.git
cd stringdist
bash ./build.bash
R CMD INSTALL output/stringdist_*.tar.gz
Warning: the github version can change any time and may not even build properly. As most
of the code is written in C
, the development version may crash your R
-session.
The following arguments have been obsolete since 2015 and have been removed in the 0.9.5.0 release (spring 2018)
cluster
for function stringdistmatrix
.maxDist
for functions stringdist
and stringdistmatrix
(not amatch
).ncores
for function stringdistmatrix
Parallelization used to be based on R's parallel
package, that works by spawning several R sessions in the background. As of version 0.9.0, stringdist
uses the more efficient openMP
protocol to parallelize everything under the hood.
The following arguments have become obsolete and will be removed somewhere in 2016:
cluster
for function stringdistmatrix
.maxDist
for functions stringdist
and stringdistmatrix
(not amatch
).ncores
for function stringdistmatrix