pckhoi / datamatch

Utilities for data matching
MIT License
5 stars 0 forks source link

Matching! #1

Open ahuds001 opened 3 years ago

ahuds001 commented 3 years ago

Carrying this over from this conversation: https://github.com/ppact/processing/issues/1

The overview of the problem is as follows. We are looking to match two datasets that have time-series data that require exclusivity. Below are a few examples:

  1. A record in dataset 1 with a date of 2016 cannot match to a record in dataset 2 with a date of 2017 if the record in dataset 1 also has dates in 2017 and 2018. The record in dataset 1 is considered exclusive to that dataset from 2016 through 2018
  2. A record in dataset 1 with a date of 2016 can only match to a record in dataset 2 if the record in dataset 2 has no known history before 2016. We say the record in dataset 1 transferred to dataset 2 in 2016 or later.

We have outlined two possible approaches. They are listed in order of simplicity/priority:

  1. We add a filtering step between indexing and matching. The filter step removes items that cannot be matched before matching takes place
  2. We extend the indexing to create a multi-index that includes a matching date logic of some sort, as described in the examples above

@pckhoi, One question. There is a directionality to all of this based on time and I think we should make this uni-directional, correct? That is to say that in our case possible match constitutes that a record in dataset 1 transferred to dataset 2 (going forwards) but we shouldn't also get a possible match from a record in dataset 2 having come from dataset 1 (going backwards).

I think that's everything from our convo, let me know if I missed anything.

pckhoi commented 3 years ago

Agreed that it should be uni-directional.

Now that I have more time to think about it, the main difference between the 2 approaches isn't about filtering vs indexing. The date range to match on that you proposed is just different from mine. Mine only include explicitly known hire and left dates whereas yours also take absence of events in a time period into consideration. Is that correct? And so I think the question become which date range algorithm should we use and I think we should use your proposal since it will be more exhaustive. And then whether we decide to use filtering or indexing it still should produce the same amount of pairs for matching.

Indeed, a correct implementation should produce the same amount of pairs no matter whether we use filtering or indexing so the speed gain of either approach is probably minimal. Therefore it comes down to taste and what building block should we build that is not just useful for solving this problem but can also be combined to solve other problems. I prefer to just use indexing class for now and try to achieve the most we could before adding new concepts for the sake of simplicity.

We don't necessarily need to achieve simplicity and reusable building block right away either. We just need to add new index class that do what we want and improve/iterate from there. I do think that a new MultiIndex class would be a good building block to add to our arsenal.

ahuds001 commented 3 years ago

Wrote a simple test using the event data to see how impactful this could. Here are the steps:

  1. Split the event table into an agency and all other agencies
  2. Created a time censored dataset, where a time censored match is when the max date from the first dataset is less than the min date from the second
  3. Compared the total number of possible matches vs the number of time censored matches

The time censoring reduced the number of possible matches by over 80% on average, which makes me think that we should definitely see some benefit from this even if the median officer only has 3 events to their name.

pckhoi commented 3 years ago

Do you want to make the first stab at writing a new index for this?

Either way we should wait, I'm making some breaking changes to add deduplicating functionality (#2). Index should be significantly simplified after this.