markvanderloo / stringdist

String distance functions for R
319 stars 36 forks source link

CRAN status DownloadsResearch software impactMentioned in Awesome Official Statistics

stringdist

Citing

Please cite the R-Journal article

@article{RJ-2014-011,
  author = {Mark P.J. van der Loo},
  title = {{The stringdist Package for Approximate String Matching}},
  year = {2014},
  journal = {{The R Journal}},
  doi = {10.32614/RJ-2014-011},
  url = {https://doi.org/10.32614/RJ-2014-011},
  pages = {111--122},
  volume = {6},
  number = {1}
}

Functionality

The package offers the following main functions:

These functions are built upon C-code that re-implements some common (weighted) string distance functions. Distance functions include:

Also, there are some utility functions:

C API

As of version 0.9.5.0 you can call a number of stringdist functions directly from the C code of your R package. The description of the API can be found

system.file("doc/stringdist_api.pdf", package="stringdist")

Examples of packages that link to stringdist can be found here and here.

Installation

To install the latest release from CRAN, open an R terminal and type

install.packages('stringdist')

To obtain the package from the very latest source code open a bash terminal (or git bash if you work under Windows with msysgit) and type

git clone https://github.com/markvanderloo/stringdist.git
cd stringdist
bash ./build.bash
R CMD INSTALL output/stringdist_*.tar.gz

Warning: the github version can change any time and may not even build properly. As most of the code is written in C, the development version may crash your R-session.

Resources

Note to users: deprecated arguments removed as of version 0.9.5.0

The following arguments have been obsolete since 2015 and have been removed in the 0.9.5.0 release (spring 2018)

Note to users: deprecated arguments as of >= 0.9.0, >= 0.9.2

Parallelization used to be based on R's parallel package, that works by spawning several R sessions in the background. As of version 0.9.0, stringdist uses the more efficient openMP protocol to parallelize everything under the hood.

The following arguments have become obsolete and will be removed somewhere in 2016: