kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
260 stars 47 forks source link

Question: window blocking conditional on another variable #44

Open froukehe0 opened 4 years ago

froukehe0 commented 4 years ago

Background: We would like to use fastLink to link data on road crashes from police and hospital records. In the blocking phase, we would like to set a window for the time between the crash as registered by the police and the time the patient arrived in the hospital. However, this window is likely to depend on the severity of the injuries: patients with milder injuries can take more time to arrive in the hospital than patients with severe injuries.

Question: Would it be possible to include dependencies between variables while blocking using fastLink? Can the window size be dependent on the value of a second variable? (in our case, we would like to have a smaller window size for more severe injuries)

Many thanks!

tedenamorado commented 4 years ago

Hi,

I hope all is well. Sadly, our blocking function is not as flexible to allow the specification of different window sizes that are related to another blocking variable. This is a good suggestion and something that we will definitely implement in the next release of fastLink.

That said, we are super excited to hear that you guys are planning to use fastLink and would be more than happy to help in any way we can.

One suggestion would be to block in two stages. Stage 1: separate those injuries that are considered mild from those that are not. If you have one variable where you can infer the severity, you can use then basically use the following code:

block_severity <- blockData(police_data, hospital_data, varnames = "severity")

This assumes that the variable "severity" is present in both datasets and the possible values it can take are the same (e.g., mild injuries, severe injuries, etc).

Once you have created the blocks (for simplicity let's assume severity only takes the two values I mentioned above), we would need to save the new data:

## Subset police data into blocks
police_block_1 <- police_data[block_severity$block.1$dfA.inds, ]
police_block_2 <- police_data[block_severity$block.2$dfA.inds, ]

## Subset hospital into blocks
hospital_block_1 <- hospital_data[block_severity$block.1$dfB.inds, ]
hospital_block_2 <- hospital_data[block_severity$block.2$dfB.inds, ]

Then it comes Step 2. Here we would need to subset police_data and hospital_data for your two pairs of subsets in terms of severity. Something like this could work:

blocks_time_severity_1 <- blockData(police_block_1, hospital_block_1, varnames = c("time"),
                           window.block = "time", window.size = 75)

blocks_time_severity_2 <- blockData(police_block_2, hospital_block_2, varnames = c("time"),
                           window.block = "time", window.size = 75)

Note that in the code above, I am assuming that there is a variable named time and this variable contains the time and date of the accident in the police data and the time and date of arrival in the hospital datasets. I am also assuming that you are OK with grouping observations that are 75 mins apart. This approach might result in the creation of many blocks (as there are 24 hours in a day).

After this step, you can use the indexes stored in blocks_time_severity_1 and blocks_time_severity_2 to conduct your merges.

Please, if anything, do not hesitate to reach out.

Stay well,

Ted

froukehe0 commented 4 years ago

Dear Ted,

Many thanks for this extensive answer, which is of great help. There is one thing that is not entirely clear to us, which is probably easy to clarify. After we have created the blocks, it seems that we then need to run fastLink on the two (or how many blocks there are) blocks separately. Wouldn't this create two separate estimates of the m- and u-probabilities? Would it be possible to estimate these probabilities on the entire set of blocked pairs instead of for each block separately?

Many thanks again, Frouke

tedenamorado commented 4 years ago

Hi Frouke,

Indeed, if you proceed as I suggested, you will get different sets of m- and u- probabilities (a set per block). One could combine all of them, the wrapper does not allow for it directly but one can actually do it indirectly. To do the aggregation there is a solution:

The good thing about this approach is that you can compare the results where you aggregate the m- and u-probabilities with that where you do not. The caveat is that you will be running things twice.

If something is unclear, just let us know.

All my best,

Ted

froukehe0 commented 4 years ago

Dear Ted,

We are still looking at combining the data sources from my original questions, but in the meanwhile I was able to apply fastLink when combining open ambulance data with open accident data (to find out that the information in the open data was insufficient to establish reliable one-to-one mappings). The report is in Dutch, but you may be able to see your paper cited: https://www.swov.nl/publicatie/koppelmogelijkheden-van-ambulancedata-met-andere-bronnen

Thanks again for making fastLink available and for your help with our questions.

tedenamorado commented 3 years ago

Hi Frouke,

Thanks a lot for posting this! We are really happy to learn that fastLink was of help to you in writing the report and we hope it continues to be a valuable tool for your research. Do not hesitate to reach out if you feel we can be of any assistance.

All my best,

Ted