tiki-deprecated / core

The core data platform infrastructure. Not for direct use.
MIT License
0 stars 0 forks source link

TIKI DeID #13

Closed timoguin closed 8 months ago

timoguin commented 10 months ago

Description

We need to develop a system to allow providers and consumers to supply sensitive fields as the basis for generating a TIKI identifier. This will allow users in our datasets to be joinable across providers and data sources without us having to ingest sensitive data.

To be discussed more and scoped.

Issues

Research to-do to create corresponding stories

To Implement

mike-audi commented 9 months ago

Potential libs to use for matching:

Common Algorithms for Identity Matching (from chat gpt)

Identity matching involves comparing different sets of data to determine whether they refer to the same entity or individual. Both probabilistic and deterministic algorithms are commonly used for identity matching, depending on the specific requirements of the application. Here are some common algorithms for identity matching:

Deterministic Matching:

Probabilistic Matching:

Machine Learning-Based Matching:

Blocking Techniques:

Hybrid Approaches:

The choice of algorithm depends on the nature of the data, the required level of accuracy, and the specific use case. Often, a combination of these algorithms or a hybrid approach is used to achieve better results in real-world scenarios.

First Steps

mike-audi commented 9 months ago

Here's the aws product designed to help create this type of matching: https://aws.amazon.com/entity-resolution/

way too expensive. sadly this one is out.

mike-audi commented 9 months ago

I'm going to set up a process where this works with simplistic deterministic matching to start with BUT where the matching logic can be expanded in the future to add in probabilistic or better yet ML based matching — Zingg is likely the right fit here.

A secondary requirement of the system is that it should be usable on both sides of the ocean, we will want to add matching on ingest as well as on customer demand

The crux of this feature is a single HEM <> DeID table.

de_id hem first_seen last_seen
xyz abc 00-00-00T00:00:00Z 00-00-00T00:00:00Z

It's important to note and support that a) multiple HEMs may be matched to a single de_id b) HEMs become less strictly matched over time, as in people change their email c) multiple de_id may be matched to a single HEM

When adding records to the mapping table we need to be using MERGE INTO queries where matched records update the last_seen date and new records are inserted.

mike-audi commented 9 months ago

For customers to upload their matching table, we have 2 options:

1) we treat this the same as our data provider flow (preferred?) 2) we use pre-signed urls to allow direct uploads to cleanrooms

Assuming 1) then we execute a matching workflow flow to deliver into the customers cleanroom a table that contains ALL of their raw data + any de_id matches found.

mike-audi commented 8 months ago

shipped to customer.