kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
260 stars 47 forks source link

How scalable fastlink with mln rows tables? #48

Closed Ibrokhimsadikov closed 3 years ago

Ibrokhimsadikov commented 3 years ago

I tried to search for this question but could not get any performance wise answers. Could anyone suggest whether fastlink is scalable enough for tables that exceeds mln rows. Thank you

aalexandersson commented 3 years ago

Currently, fastLink without blocking cannot handle tables with millions of rows. You would need splink (an independently developed version of fastLink for Apache Spark) or fastLink with blocking for such large linkages. I am not a fastLink developer but I routinely use fastLink with approximately 0.1 * 3.5 million rows which in practice require at least two blocks.

The developers are working on making fastLink faster. Hopefully, they can release a faster version in 2021.

Ibrokhimsadikov commented 3 years ago

Thank you, @aalexandersson for your answer, much appreciate your opinion