Entity Embed allows you to transform entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.
Using Entity Embed, you can train a deep learning model to transform records into vectors in an N-dimensional embedding space. Thanks to a contrastive loss, those vectors are organized to keep similar records close and dissimilar records far apart in this embedding space. Embedding records enables scalable ANN search, which means finding thousands of candidate duplicate pairs of records per second per CPU.
Entity Embed achieves Recall of ~0.99 with Pair-Entity ratio below 100 on a variety of datasets. Entity Embed aims for high recall at the expense of precision. Therefore, this library is suited for the Blocking/Indexing stage of an Entity Resolution pipeline. A scalabale and noise-tolerant Blocking procedure is often the main bottleneck for performance and quality on Entity Resolution pipelines, so this library aims to solve that. Note the ANN search on embedded records returns several candidate pairs that must be filtered to find the best matching pairs, possibly with a pairwise classifier (an example for that is available).
Entity Embed is based on and is a special case of the AutoBlock model described by Amazon.
⚠️ Warning: this project is under heavy development.
https://entity-embed.readthedocs.io
And others, see requirements.txt.
pip install entity-embed
If you're using Conda, you must install PyTorch beforehand to have proper CUDA support. Inside the Conda environment, please run the following command before installing Entity Embed using pip
:
conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c conda-forge
Run:
pip install -r requirements-examples.txt
Then check the example Jupyter Notebooks:
Please check notebooks/google-colab/.
See CHANGELOG.md.
This project is maintained by open-source contributors and Vinta Software.
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage
project template.
Vinta Software is always looking for exciting work, so if you need any commercial support, feel free to get in touch: contact@vinta.com.br
If you use Entity Embed in your research, please consider citing it.
BibTeX entry:
@software{entity-embed,
title = {{Entity Embed}: Scalable Entity Resolution using Approximate Nearest Neighbors.},
author = {Juvenal, Flávio and Vieira, Renato},
url = {https://github.com/vintasoftware/entity-embed},
version = {0.0.6},
date = {2021-07-16},
year = {2021}
}