kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
253 stars 46 forks source link

Performance (gamma*() functions) #76

Open jw2249a opened 7 months ago

jw2249a commented 7 months ago

So I am going to make a larger pull request on this, but I noticed there were some optimization problems with the gamma*() functions.

Avoidance of factors I notice you coerce the inputs into character vectors repeatedly, but isn't it better to use factors (as they build in the unique values and match the default handling of string values for dataframes)? The use of factors allows for the quick finding of matching values across vectors because the index of the level matches the numeric value of the match. Moving the multiple parallel calls to 1 because of the use of factors also helps avoid the costly manual gc() call.

Referencing objects in parallel I noticed there was a lot of memory usage relative to the number of processes spawned by the doparallel package so I used setRefClass to point to the object in memory rather than pass the objects explicitly to the function. This allowed me to increase the number of splices without the penalty created by memory pressure.

I benchmarked gammaCKpar edits with everything from character vectors of (100x1000,1000x1000,1000x1000000, 10000x2000000) and found the new version ran consistently faster than the run time of the original. For anything above 1000x1000, it consistently sat at around 1/3 the run time of the original.

I plan on packaging this in a pull request, but you can see the modified file here. I changed a few other functions that I plan to commit, but could you let me know if you are interested in it as this? If you plan to make the modifications separately I would probably deviate a bit more from the original build for other performance considerations, but, if useful, I can stick to the original structure for easier incorporation of changes.

Thanks, Jack

jw2249a commented 7 months ago

Also, I implemented both of these in julia, but I was considering adding it to the R package. Running Jaro-Winkler in batches and potentially using this cheaty jaro-winkler algorithm could be hugely beneficial to the base program.

tedenamorado commented 7 months ago

This looks fantastic! We will take a close look at your fixes and test them during the Winter break. We are also working on some additional efficiency gains -- something along the lines of the cheaty jaro-winkler algorithm, but more general (as it is not tied to a specific string similarity measure).

Thanks so much, @jw2249a!

jw2249a commented 6 months ago

So I have been adapting your algorithm to julia, and I have found a huge benefit to the fuzzy jaro-winkler algorithm. I am planning on adding support for other fuzzy string distances, but I was able to get to the tableCounts step using the other jw algorithm in 19 minutes for 1e10 comparisons (1m x 10k strings; 3 variables; no blocking). I wasn't able to to run the original gammaCKpar fun on that because of memory issues (I only have 112GB) but the results for 1e9 comparisons across 3 vars was: ----------------------------------------------(FUZZY)----------------------------------------------- 251.354 s (7522092396 allocations: 335.77 GiB)

--------------------------------------------(NOT FUZZY)-------------------------------------------- 516.706 s (16045371802 allocations: 3232.16 GiB)

I made a few optimizations like for a low number of comparison variables with 6bits of complexity or less (e.g. 3 vars with 3 levels + missing), I just brute force search the generated bitvectors to get counts because there is a max of 64 values.