timoguin commented 10 months ago

Description

We need to develop a system to allow providers and consumers to supply sensitive fields as the basis for generating a TIKI identifier. This will allow users in our datasets to be joinable across providers and data sources without us having to ingest sensitive data.

To be discussed more and scoped.

Issues

Research to-do to create corresponding stories

[x] table mapping TID to N number of identifier columns with some sort of confidence intervals
[x] find research paper on algorithms for ID matching
[x] design a way for a company to upload their own ID table to their cleanroom
[x] figure out how to make the ID matching table accessible without exposing internal IDs (can probably do it via LF tags)

To Implement

[ ] tooling: https://github.com/tiki/core-ocean-pipelines/issues/18
[ ] ingestion: https://github.com/tiki/private/issues/71
[ ] ingestion: Support ingestion / fetching of consumer identifiers
[ ] generation: Create de_id records for any newly seen identifiers
[ ] generation: Update de_id records with timestamps each time identifiers are seen
[ ] matching: For identifiers provided by consumers, handle deterministic lookup for de_id records
[ ] tokenization: For each consumer organization, maintain index of tokenized records
[ ] delivery: Incrementally populate consumer cleanrooms w/ de_id records

mike-audi commented 9 months ago

Potential libs to use for matching:

Common Algorithms for Identity Matching (from chat gpt)

Identity matching involves comparing different sets of data to determine whether they refer to the same entity or individual. Both probabilistic and deterministic algorithms are commonly used for identity matching, depending on the specific requirements of the application. Here are some common algorithms for identity matching:

Deterministic Matching:

Exact Matching: This is the simplest form of deterministic matching, where two records are considered a match only if all the attributes match exactly.
Token-based Matching: Breaks down strings into tokens (words or sub-strings) and compares the presence or absence of these tokens across records.
Key-based Matching: Uses a predefined set of keys or identifiers to match records based on common attributes.

Probabilistic Matching:

Jaccard Similarity: Measures the similarity between two sets by comparing the intersection and union of their elements.
Levenshtein Distance (Edit Distance): Calculates the minimum number of single-character edits (insertions, deletions, substitutions) required to transform one string into another.
Soundex: A phonetic algorithm that categorizes words into groups based on their pronunciation, allowing for approximate matches of similar-sounding words.
Metaphone: Similar to Soundex but considers a wider range of English language variations.

Machine Learning-Based Matching:

Support Vector Machines (SVM): Can be used for binary classification tasks, including identity matching.
Random Forests: An ensemble learning method that can handle complex matching scenarios by combining multiple decision trees.
Neural Networks: Deep learning models, such as Siamese networks, can be used for learning complex patterns in identity matching.

Blocking Techniques:

Blocking: Reduces the search space by dividing records into blocks based on certain attributes, allowing for faster matching within smaller blocks.

Hybrid Approaches:

Combining Deterministic and Probabilistic Methods: Using a combination of exact matching and probabilistic techniques to improve overall matching accuracy.

The choice of algorithm depends on the nature of the data, the required level of accuracy, and the specific use case. Often, a combination of these algorithms or a hybrid approach is used to achieve better results in real-world scenarios.

First Steps

Regardless of the algorithms used, we will need to coalesce identity matching data into a table.
There are a couple questions to resolve:
- handling duplicate deterministic ids (HEM)
- is there anything we really should set up now, in advance to support future fields not just HEM
- handling multiple deterministic ids (HEM) for a single user
How a customer performs a match

mike-audi commented 9 months ago

Here's the aws product designed to help create this type of matching: https://aws.amazon.com/entity-resolution/

way too expensive. sadly this one is out.

mike-audi commented 9 months ago

I'm going to set up a process where this works with simplistic deterministic matching to start with BUT where the matching logic can be expanded in the future to add in probabilistic or better yet ML based matching — Zingg is likely the right fit here.

A secondary requirement of the system is that it should be usable on both sides of the ocean, we will want to add matching on ingest as well as on customer demand

The crux of this feature is a single HEM <> DeID table.

de_id	hem	first_seen	last_seen
xyz	abc	00-00-00T00:00:00Z	00-00-00T00:00:00Z

It's important to note and support that a) multiple HEMs may be matched to a single de_id b) HEMs become less strictly matched over time, as in people change their email c) multiple de_id may be matched to a single HEM

When adding records to the mapping table we need to be using MERGE INTO queries where matched records update the last_seen date and new records are inserted.

mike-audi commented 9 months ago

For customers to upload their matching table, we have 2 options:

1) we treat this the same as our data provider flow (preferred?) 2) we use pre-signed urls to allow direct uploads to cleanrooms

Assuming 1) then we execute a matching workflow flow to deliver into the customers cleanroom a table that contains ALL of their raw data + any de_id matches found.

mike-audi commented 8 months ago

shipped to customer.

tiki-deprecated / core

TIKI DeID #13

Description

Issues

Research to-do to create corresponding stories

To Implement

Potential libs to use for matching:

Common Algorithms for Identity Matching (from chat gpt)

Deterministic Matching:

Probabilistic Matching:

Machine Learning-Based Matching:

Blocking Techniques:

Hybrid Approaches:

First Steps