kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
272 stars 48 forks source link

Error in mclapply() on Windows #15

Closed aalexandersson closed 7 years ago

aalexandersson commented 7 years ago

I am using fastLink on confidential data and get an error in mclapply(). I am using fastLink version 0.1.1 on Windows 7 with 4 cores.

This is the problematic R command:

library(fastLink)
> fl.out <- fastLink(rpatient7, racs7, 
>     varnames = c("bstate", "sex", "nysf", "nysl", "ssn", "dob"),
>     stringdist.match = c("nysf", "dob"), n.cores = 2)

This is the problematic R output:

==================== 
fastLink(): Fast Probabilistic Record Linkage
==================== 

Calculating matches for each variable.
Error in mclapply(matches.2, function(s) { : 
  'mc.cores' > 1 is not supported on Windows

Immediately after the error, I typed traceback() and this is the result:

> traceback()
4: stop("'mc.cores' > 1 is not supported on Windows")
3: mclapply(matches.2, function(s) {
       ht1 <- which(matrix.1 == s[1])
       ht2 <- which(matrix.2 == s[2])
       list(ht1, ht2)
   }, mc.cores = getOption("mc.cores", no_cores))
2: gammaCK2par(dfA[, varnames[i]], dfB[, varnames[i]], cut.a = cut.a, 
       method = stringdist.method, w = jw.weight, n.cores = n.cores)
1: fastLink(rpatient7, racs7, varnames = c("bstate", "sex", "nysf", 
       "nysl", "ssn", "dob"), stringdist.match = c("nysf", "dob"), 
       n.cores = 2)

If I change the syntax from n.cores = 2 to n.cores = 1 (or if I omit the option) then the R output is fine.

I could not reproduce the error on datasets dfA and dfB. The problem with mclapply() on Windows is discussed further at https://www.r-bloggers.com/implementing-mclapply-on-windows-a-primer-on-embarrassingly-parallel-computation-on-multicore-systems-with-r/

Please advice.

tedenamorado commented 7 years ago

Hi,

Thanks for pointing that out. If you want to circumvent that issue, the best choice would be to install fastLink from gitHub directly i.e.,

library(devtools) install_github("kosukeimai/fastLink",dependencies=TRUE)

Since you are using a Windows computer you will need to install Rtools as well. The latest version of Rtools can be downloaded here:

http://mirror.fcaglp.unlp.edu.ar/CRAN/bin/windows/Rtools/

If the problem persists, please let us know.

Ted

aalexandersson commented 7 years ago

Hi Ted,

The development version of fastLink worked. Thank you.

I installed the development version from RStudio. RStudio conveniently asked if I want to install Rtool which I did. I have a minor follow-up question and a feature wish.

Follow-up question: Does the option n.cores perhaps refer to threads rather than to cores?

I have a 4-core 8-thread CPU. When I specify "n.cores = 2" I get in part the output "(Using OpenMP to parallelize calculation. 2 threads out of 8 are used.)". When I instead specify "n.cores = 8" I get "[...] 8 threads out of 8 are used.)". This suggests that n.threads is more accurate.

Feature wish: Classification table, for example as in the package RecordLinkage.

A classification table (a.k.a. table of confusion or error matrix) is the traditional summary of linkage results and can also be used to calculate other summary measures than match count, match rate, FDR and FNR such as, for example, link count, and F-measure.

Anders

tedenamorado commented 7 years ago

Hi Anders,

We are glad your issue is solved now.

Regarding your questions/request:

  1. Yes, you are right, for most systems thread should be a better way to describe that option. We will try to incorporate such a change in a future release.

  2. Currently, we are developing functions that will include detailed confusion tables and other ways to present graphically the results.

Note that if you use the wrapper function fastLink(), it is possible to obtain basic summary stats like match rate, FDR, and FNR. For the step-by-step implementation we do not have such functions yet, but as noted above, we are close to having them finished. We will keep you posted!

Thanks for using fastLink and please keep us posted on how your project goes. Again, any additional feedback would be greatly appreciated.

Ted

aalexandersson commented 7 years ago

Hi Ted,

This is great news, thank you!

A feature related to the confusion table that also would be nice to have is a simple way to add the assumed true match status before doing the linkage. In the R package RecordLinkage this is done as two identity vectors which are harder than variables to work with for R beginners like myself.

I am a statistician at the central Florida Cancer Data System (FCDS), which is affiliated with University of Miami. At FCDS, I currently use Stata for pre-processing and for clerical review and the RecordLinkage package for the probabilistic record linkage itself. I am primarily a Stata user but Stata does not have a good program for probabilistic record linkage. Since you are close to finish an improved version, I will wait with additional feedback until you make it available.

Anders

On Thu, Aug 31, 2017 at 12:31 PM, tedenamorado notifications@github.com wrote:

Hi Anders,

We are glad your issue is solved now.

Regarding your questions/request:

1.

Yes, you are right, for most systems thread should be a better way to describe that option. We will try to incorporate such a change in a future release. 2.

Currently, we are developing functions that will include detailed confusion tables and other ways to present graphically the results.

Note that if you use the wrapper function fastLink(), it is possible to obtain basic summary stats like match rate, FDR, and FNR. For the step-by-step implementation we do not have such functions yet, but as noted above, we are close to having them finished. We will keep you posted!

Thanks for using fastLink and please keep us posted on how your project goes. Again, any additional feedback would be greatly appreciated.

Ted

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kosukeimai/fastLink/issues/15#issuecomment-326351293, or mute the thread https://github.com/notifications/unsubscribe-auth/AThPmpxArMmF0wKjGujUnh6ETYwiYtNeks5sdt_tgaJpZM4PH-Fi .

aalexandersson commented 7 years ago

Hi Ted,

Since you posted version 0.2.0 already, I installed it on Windows and here is more feedback:

Thank you! I can use the new functions inspectEM() and plot() but I get this error when I try to get online help:

Error in fetch(key) : lazy-load database 'C:/Users/aalexandersson/Documents/R/win-library/3.4/fastLink/help/fastLink.rdb' is corrupt

Are inspectEM() and plot() what you meant by "detailed confusion tables"? I am wishing for a traditional 2*2 confusion table illustrating the four outcomes of data matching classification: true positives, false positives, false negatives, true negatives. Is there a simple way already to extract those four summary counts?

Anders

On Thu, Aug 31, 2017 at 4:14 PM, Anders Alexandersson andersalex@gmail.com wrote:

Hi Ted,

This is great news, thank you!

A feature related to the confusion table that also would be nice to have is a simple way to add the assumed true match status before doing the linkage. In the R package RecordLinkage this is done as two identity vectors which are harder than variables to work with for R beginners like myself.

I am a statistician at the central Florida Cancer Data System (FCDS), which is affiliated with University of Miami. At FCDS, I currently use Stata for pre-processing and for clerical review and the RecordLinkage package for the probabilistic record linkage itself. I am primarily a Stata user but Stata does not have a good program for probabilistic record linkage. Since you are close to finish an improved version, I will wait with additional feedback until you make it available.

Anders

On Thu, Aug 31, 2017 at 12:31 PM, tedenamorado notifications@github.com wrote:

Hi Anders,

We are glad your issue is solved now.

Regarding your questions/request:

1.

Yes, you are right, for most systems thread should be a better way to describe that option. We will try to incorporate such a change in a future release. 2.

Currently, we are developing functions that will include detailed confusion tables and other ways to present graphically the results.

Note that if you use the wrapper function fastLink(), it is possible to obtain basic summary stats like match rate, FDR, and FNR. For the step-by-step implementation we do not have such functions yet, but as noted above, we are close to having them finished. We will keep you posted!

Thanks for using fastLink and please keep us posted on how your project goes. Again, any additional feedback would be greatly appreciated.

Ted

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kosukeimai/fastLink/issues/15#issuecomment-326351293, or mute the thread https://github.com/notifications/unsubscribe-auth/AThPmpxArMmF0wKjGujUnh6ETYwiYtNeks5sdt_tgaJpZM4PH-Fi .

aalexandersson commented 7 years ago

Email edits: (1) Changed subject title (2) For example, typing insp gives a popup window with "R code execution error", and the command line error message as described below.

On Fri, Sep 1, 2017 at 2:45 PM, Anders Alexandersson andersalex@gmail.com wrote:

Hi Ted,

Since you posted version 0.2.0 already, I installed it on Windows and here is more feedback:

Thank you! I can use the new functions inspectEM() and plot() but I get this error when I try to get online help:

Error in fetch(key) : lazy-load database 'C:/Users/aalexandersson/Documents/R/win-library/3.4/fastLink/help/fastLink.rdb' is corrupt

Are inspectEM() and plot() what you meant by "detailed confusion tables"? I am wishing for a traditional 2*2 confusion table illustrating the four outcomes of data matching classification: true positives, false positives, false negatives, true negatives. Is there a simple way already to extract those four summary counts?

Anders

On Thu, Aug 31, 2017 at 4:14 PM, Anders Alexandersson < andersalex@gmail.com> wrote:

Hi Ted,

This is great news, thank you!

A feature related to the confusion table that also would be nice to have is a simple way to add the assumed true match status before doing the linkage. In the R package RecordLinkage this is done as two identity vectors which are harder than variables to work with for R beginners like myself.

I am a statistician at the central Florida Cancer Data System (FCDS), which is affiliated with University of Miami. At FCDS, I currently use Stata for pre-processing and for clerical review and the RecordLinkage package for the probabilistic record linkage itself. I am primarily a Stata user but Stata does not have a good program for probabilistic record linkage. Since you are close to finish an improved version, I will wait with additional feedback until you make it available.

Anders

On Thu, Aug 31, 2017 at 12:31 PM, tedenamorado notifications@github.com wrote:

Hi Anders,

We are glad your issue is solved now.

Regarding your questions/request:

1.

Yes, you are right, for most systems thread should be a better way to describe that option. We will try to incorporate such a change in a future release. 2.

Currently, we are developing functions that will include detailed confusion tables and other ways to present graphically the results.

Note that if you use the wrapper function fastLink(), it is possible to obtain basic summary stats like match rate, FDR, and FNR. For the step-by-step implementation we do not have such functions yet, but as noted above, we are close to having them finished. We will keep you posted!

Thanks for using fastLink and please keep us posted on how your project goes. Again, any additional feedback would be greatly appreciated.

Ted

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kosukeimai/fastLink/issues/15#issuecomment-326351293, or mute the thread https://github.com/notifications/unsubscribe-auth/AThPmpxArMmF0wKjGujUnh6ETYwiYtNeks5sdt_tgaJpZM4PH-Fi .

kosukeimai commented 7 years ago

Try uninstalling and then reinstalling the package to see if that fixes the problem.

tedenamorado commented 7 years ago

Hi Anders,

I hope all is OK.

Are you still having these issues when using fastLink?

Are inspectEM() and plot() what you meant by "detailed confusion tables"? No, these functions are designed to make plots that present the agreement vectors in an easy-to-interpret fashion.

As per the confusion table, we are still working on such a function. I will let you know when we push it.

Thanks a lot for patience and all your feedback! We hope fastLink helps with the record linkage problem you are dealing with.

Ted

aalexandersson commented 7 years ago

Hi Ted,

Sorry for being late in my reply. Everything is OK. I did not have a chance to reinstall the software at work yet because I am home taking the long weekend off. I do not expect the help file to remain a problem on my Windows 7 with a clean install. I am closing the issue. If the problem remains I will let you know.

Thank you so much for working on adding a confusion table! It would make the results more comparable to the R package RecordLinkage, and to traditional output. A confusion table is the main feature that I and Florida Cancer Data System miss. It would enable me to switch record linkage software at work from RecordLinkage to fastLink.

Best wishes, Anders

aalexandersson commented 7 years ago

A clean re-installation of version 0.2.0 from CRAN fixed the problem with the corrupted help file.

aalexandersson commented 7 years ago

Stata has a user-written command classtabi which concretely shows another example how the confusion matrix can be displayed. Unfortunately, the program has two minor bugs which are described on Statalist here:

https://www.statalist.org/forums/forum/general-stata-discussion/general/1321572-a-new-command-classtabi-now-available-for-download-from-ssc

Hope this helps, Anders

aalexandersson commented 7 years ago

Ariel Linden has now updated his Stata program classtabi to fix the two bugs. Hopefully you find it useful for developing a similar confusion matrix in fastLink.

https://www.statalist.org/forums/forum/general-stata-discussion/general/1321572-a-new-command-classtabi-now-available-for-download-from-ssc?p=1413865#post1413865

tedenamorado commented 7 years ago

Thanks a lot for sharing this with us! We are close to release a new version of the package and we promise that the new function with a confusion table will be released then.

In addition, we are adding two new functions that will allow the users to compare numeric variables based on the absolute difference between them.