kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
253 stars 46 forks source link

Improving gamma.R funs #77

Open jw2249a opened 6 months ago

jw2249a commented 6 months ago

importing collapse::qF for the quick creation of factors. Then refactoring gamma funs to use those factors and references to decrease the number of parallel calls and memory pressure. This improves performance quite a bit

jw2249a commented 6 months ago

My benchmark was with first names from two datasets (one voter file and another list). I tested with exponentially larger sets and the speed and memory usage was especially noticeable on the expensive gammaCKpar.R files.

One addition is that matrices are often used instead of vectors. In my most main branch the cpp file doesn't have matrices as input. I don't know if multidimensional vars were considered at one point, but there are a bunch of calls and coercion to form these that can be removed.

tedenamorado commented 6 months ago

Thanks so much for sharing this with us @jw2249a! This is fantastic! I am checking the new functions as we speak. I will report back soon.

Re matrices vs vectors: your intuition is correct. We left the door open for linkage fields that could be compared in more complex ways.