kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
272 stars 48 forks source link

Running time #60

Open MAranzazuRU89 opened 2 years ago

MAranzazuRU89 commented 2 years ago

I had a question with the expected running time and computing capacity I need to plan to use fastLink. I am trying to run it on a database of 1.7M observations, only matching on two variables. However, so far (and the code has been running for 12h) I have not been able to run past the first task of calculating matches for each variable. So I was wondering whether this is to be expected and I should move to a cluster or whether this sounds weird and I am doing something wrong. Thank you!

aalexandersson commented 2 years ago

Disclaimer: I am a regular fastLink user, not a fastLink developer.

It depends. Details matter. Please show the fastLink code that you used. Do you use blocking?

tedenamorado commented 2 years ago

Hi @MAranzazuRU89,

Like @aalexandersson mentions, a bit more context could be of help here. If it happens that your data allows for blocking (creating subsets of observations similar in at least one dimension), then I have no doubt the task you have in mind can be scaled and perhaps finished in less than 12 hours. If blocking is not an option, then computing power could be a solution.

Keep us posted!

Ted

MAranzazuRU89 commented 2 years ago

Hi! I thought I couldn't block but now I think I can. I will try that, and if not, then I think i'll move to a cluster. But I think the smarter move could be to try blocking. Thank you!

ishanaratan commented 2 years ago

Hi! I have a question directly related to run time reduction. I am trying to run fastLink on a cluster computer (matching a few million firms), and was wondering if I needed to specify the number of nodes available (and perhaps structure the code differently)?

I didn't see a mention of how to do this in the documentation, but perhaps missed it. Thanks in advance!

tedenamorado commented 2 years ago

Hi @ishanaratan,

If you are using a cluster computer. I would do the following:

  1. Block the data. For example, if you match firms from different cities, one idea is to subset your data by city name.
  2. Run fastLink matching one subset per node (or group of nodes). Within the node, fastLink will allocate the number of clusters so that you do run into memory issues.

fastLink runs in parallel within a node, but not across nodes. If the nodes have multiple threads, fastLink will make use of all of them if the size of the data is significant. If it is small, then it will use the minimum number of threads needed.

Please, if anything, let us know.

All my best,

Ted