Closed timoguin closed 8 months ago
Identity matching involves comparing different sets of data to determine whether they refer to the same entity or individual. Both probabilistic and deterministic algorithms are commonly used for identity matching, depending on the specific requirements of the application. Here are some common algorithms for identity matching:
The choice of algorithm depends on the nature of the data, the required level of accuracy, and the specific use case. Often, a combination of these algorithms or a hybrid approach is used to achieve better results in real-world scenarios.
Here's the aws product designed to help create this type of matching: https://aws.amazon.com/entity-resolution/
way too expensive. sadly this one is out.
I'm going to set up a process where this works with simplistic deterministic matching to start with BUT where the matching logic can be expanded in the future to add in probabilistic or better yet ML based matching — Zingg is likely the right fit here.
A secondary requirement of the system is that it should be usable on both sides of the ocean, we will want to add matching on ingest as well as on customer demand
The crux of this feature is a single HEM <> DeID table.
de_id | hem | first_seen | last_seen |
---|---|---|---|
xyz | abc | 00-00-00T00:00:00Z | 00-00-00T00:00:00Z |
It's important to note and support that a) multiple HEMs may be matched to a single de_id b) HEMs become less strictly matched over time, as in people change their email c) multiple de_id may be matched to a single HEM
When adding records to the mapping table we need to be using MERGE INTO queries where matched records update the last_seen date and new records are inserted.
For customers to upload their matching table, we have 2 options:
1) we treat this the same as our data provider flow (preferred?) 2) we use pre-signed urls to allow direct uploads to cleanrooms
Assuming 1) then we execute a matching workflow flow to deliver into the customers cleanroom a table that contains ALL of their raw data + any de_id matches found.
shipped to customer.
Description
We need to develop a system to allow providers and consumers to supply sensitive fields as the basis for generating a TIKI identifier. This will allow users in our datasets to be joinable across providers and data sources without us having to ingest sensitive data.
To be discussed more and scoped.
Issues
Research to-do to create corresponding stories
To Implement