kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
253 stars 46 forks source link

Running fastLink on several cores/threads on #46

Closed felixhaass closed 3 years ago

felixhaass commented 3 years ago

Hi,

first off, thanks for the terrific package, it has helped me a lot in my research.

A quick question out of curiosity: I'm trying to run fastlink() on an ARM cluster computer with 64 cores. However, no matter how many cores I specify through n.cores, the system always tells me:

Parallelizing calculation using OpenMP. 1 threads out of 64 are used.

So, it's using only 1 thread of the many more available which is very inefficient.

Digging through the source code of the package I found that some of the functions check whether or the OS is a MacOS by including the line if(Sys.info()[['sysname']] == 'Darwin') (e.g. here ). I was wondering whether that's what's causing the functions to be limited to only one core? And, if so, whether that OS restriction is necessary due to the way how the parallel package does things?

In any case, thanks again for your terrific software.

Felix

tedenamorado commented 3 years ago

Hi Felix,

Thanks for raising this issue. How large are the datasets you passing fastLink for the merge?

Thanks!

Ted

felixhaass commented 3 years ago

Hi Ted,

thanks for the quick reply! I've varied the size from 5k to 400k observations. When using larger datasets the parallel processing actually kicks in. So, I'm assuming that is expected behavior, since with smaller datasets the overhead of distributing the computation to various cores is probably too expensive... if that's the case, apologies for my question and you can close the issue.

Cheers Felix

tedenamorado commented 3 years ago

Hi Felix.

That is indeed the case! If some of the datasets are small (less than 4500 observations each), 1 core is more than enough. Thanks for raising the issue!

All my best,

Ted