move-coop / parsons

A python library of connectors for the progressive community.
https://www.parsonsproject.org/
Other
257 stars 125 forks source link

Add IDRT connector for duplicate contact detection #934

Open Jason94 opened 7 months ago

Jason94 commented 7 months ago

This PR adds a connector for the IDRT (Identity Resolution Transformer) library in Parsons. IDRT is an open-source library that I wrote (https://github.com/Jason94/identity-resolution) to use neural networks to match duplicate contacts in a database.

If any reviewers want to test the code, they can use these model files to run the example scripts in the documentation: models.zip

The connector does provide easy function calls into the two steps of the main algorithm that the library exposes. This algorithm is designed to run directly against a large dataset of contacts stored in a database (Redshift, BigQuery, etc). It makes use of the database during several intermediate steps to reduce execution time.

The connector does not provide an easy way to quickly match against a Parsons Table containing contact data. The focus of the library is to do this at-scale, so that's where the current focus is. I'd like to add another function at some point that is simpler, and just takes a Parsons table of contact data and some basic configuration and does a match search among the rows of the table. If reviewers think that is likely to be a common use-case, I can add it to the PR before merging.

The connector also does not provide any ways to train the neural networks. That is a much more advanced task than using an existing model, and I didn't see how adding anything to Parsons would make that any easier.

Notes for reviewers:

  1. The IDRT library pulls in some pretty hefty deep learning libraries as dependencies. I really did not want to add those as default dependencies to Parsons. I noticed that anything listed as a Parsons "extra" still gets installed by default, if you don't have the limited dependencies option turned on. I modified the Parsons dependency mechanism in the setup.py file to allow truly optional dependencies that must be explicitly installed. In this case, by running pip install parsons[idr].
  2. The code is pretty lightweight, all things considered. It's mostly wrapping the calls to the library function in our standard environment variable conventions and providing documentation. The library uses PETL, so it's easy to convert to and from a Parsons Table.
  3. The connector includes an adapter to use any Parsons DatabaseConnector with the algorithm. It does check to make sure that the upsert function is defined on the database object, which currently isn't standard in the DatabaseConnector interface. It should currently work for Redshift and BigQuery.