Open MAranzazuRU89 opened 2 years ago
Disclaimer: I am a regular fastLink user, not a fastLink developer.
It depends. Details matter. Please show the fastLink code that you used. Do you use blocking?
Hi @MAranzazuRU89,
Like @aalexandersson mentions, a bit more context could be of help here. If it happens that your data allows for blocking (creating subsets of observations similar in at least one dimension), then I have no doubt the task you have in mind can be scaled and perhaps finished in less than 12 hours. If blocking is not an option, then computing power could be a solution.
Keep us posted!
Ted
Hi! I thought I couldn't block but now I think I can. I will try that, and if not, then I think i'll move to a cluster. But I think the smarter move could be to try blocking. Thank you!
Hi! I have a question directly related to run time reduction. I am trying to run fastLink on a cluster computer (matching a few million firms), and was wondering if I needed to specify the number of nodes available (and perhaps structure the code differently)?
I didn't see a mention of how to do this in the documentation, but perhaps missed it. Thanks in advance!
Hi @ishanaratan,
If you are using a cluster computer. I would do the following:
fastLink runs in parallel within a node, but not across nodes. If the nodes have multiple threads, fastLink will make use of all of them if the size of the data is significant. If it is small, then it will use the minimum number of threads needed.
Please, if anything, let us know.
All my best,
Ted
I had a question with the expected running time and computing capacity I need to plan to use fastLink. I am trying to run it on a database of 1.7M observations, only matching on two variables. However, so far (and the code has been running for 12h) I have not been able to run past the first task of calculating matches for each variable. So I was wondering whether this is to be expected and I should move to a cluster or whether this sounds weird and I am doing something wrong. Thank you!