Open ahuds001 opened 3 years ago
Agreed that it should be uni-directional.
Now that I have more time to think about it, the main difference between the 2 approaches isn't about filtering vs indexing. The date range to match on that you proposed is just different from mine. Mine only include explicitly known hire and left dates whereas yours also take absence of events in a time period into consideration. Is that correct? And so I think the question become which date range algorithm should we use and I think we should use your proposal since it will be more exhaustive. And then whether we decide to use filtering or indexing it still should produce the same amount of pairs for matching.
Indeed, a correct implementation should produce the same amount of pairs no matter whether we use filtering or indexing so the speed gain of either approach is probably minimal. Therefore it comes down to taste and what building block should we build that is not just useful for solving this problem but can also be combined to solve other problems. I prefer to just use indexing class for now and try to achieve the most we could before adding new concepts for the sake of simplicity.
We don't necessarily need to achieve simplicity and reusable building block right away either. We just need to add new index class that do what we want and improve/iterate from there. I do think that a new MultiIndex
class would be a good building block to add to our arsenal.
Wrote a simple test using the event data to see how impactful this could. Here are the steps:
The time censoring reduced the number of possible matches by over 80% on average, which makes me think that we should definitely see some benefit from this even if the median officer only has 3 events to their name.
Do you want to make the first stab at writing a new index for this?
Either way we should wait, I'm making some breaking changes to add deduplicating functionality (#2). Index should be significantly simplified after this.
Carrying this over from this conversation: https://github.com/ppact/processing/issues/1
The overview of the problem is as follows. We are looking to match two datasets that have time-series data that require exclusivity. Below are a few examples:
We have outlined two possible approaches. They are listed in order of simplicity/priority:
@pckhoi, One question. There is a directionality to all of this based on time and I think we should make this uni-directional, correct? That is to say that in our case possible match constitutes that a record in dataset 1 transferred to dataset 2 (going forwards) but we shouldn't also get a possible match from a record in dataset 2 having come from dataset 1 (going backwards).
I think that's everything from our convo, let me know if I missed anything.