kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
260 stars 47 forks source link

Use of gc() creates a constant overhead to calling fastLink() #73

Open zmbc opened 1 year ago

zmbc commented 1 year ago

I ran into an issue running fastLink with extremely restrictive blocking. Because of fastLink's approach, each block requires a separate call to the fastLink function. However, I found that no matter how small I made the blocks, that function would take about a second to run (with an em.obj provided). I don't think this has anything to do with my data in particular, but if you are not able to immediately reproduce this by linking two 1-row dataframes, I can provide a reprex.

I ran the function with profvis and found that the vast majority of the time was being spent in garbage collection, kicked off by the manual use of the gc() function. I believe calling this function when it is not needed (the data is so small it is taking very little memory) creates a constant runtime overhead.

I confirmed that this was causing the performance issue by forking this repository and creating a branch that removed all uses of gc() on the codepath I was using. After installing from this branch, my runtime with very small blocks decreased by about 20x.

The "Advanced R" book by Hadley Wickham says: "Despite what you might have read elsewhere, there’s never any need to call gc() yourself. R will automatically run garbage collection whenever it needs more space..." I'm wondering if you did any profiling that justified the use of gc(); if so, perhaps it should at least be gated behind something having to do with data size.

zmbc commented 1 year ago

As a more general note, even after removing this bottleneck, using restrictive blocking is not feasible with fastLink. I think because the operations are vectorized, it is still much faster to run large blocks than to run fastLink separately for each small block. If blocking could be built into the fastLink method, I expect it would be possible to get speedups with more restrictive blocking.

jw2249a commented 9 months ago

I'm not the original author, but I noticed the same thing. I've seen benefits from explicit dereferencing (i.e. setting var <- NULL), but never from gc().

tedenamorado commented 9 months ago

Thanks for raising this issue! The calls to gc() come from the original fastLink code written in 2015/2016. I sense that recycling did not work as well as it does today. We will work on improving any memory overhead that comes from gc().

We plan to release a new version of fastLink that will keep the same structure but will be faster and, more importantly, do so w/o sacrificing accuracy.

Thanks so much for all your support!