Closed aalexandersson closed 7 years ago
Hi,
Thanks for pointing that out. If you want to circumvent that issue, the best choice would be to install fastLink from gitHub directly i.e.,
library(devtools) install_github("kosukeimai/fastLink",dependencies=TRUE)
Since you are using a Windows computer you will need to install Rtools as well. The latest version of Rtools can be downloaded here:
http://mirror.fcaglp.unlp.edu.ar/CRAN/bin/windows/Rtools/
If the problem persists, please let us know.
Ted
Hi Ted,
The development version of fastLink worked. Thank you.
I installed the development version from RStudio. RStudio conveniently asked if I want to install Rtool which I did. I have a minor follow-up question and a feature wish.
Follow-up question: Does the option n.cores perhaps refer to threads rather than to cores?
I have a 4-core 8-thread CPU. When I specify "n.cores = 2" I get in part the output "(Using OpenMP to parallelize calculation. 2 threads out of 8 are used.)". When I instead specify "n.cores = 8" I get "[...] 8 threads out of 8 are used.)". This suggests that n.threads is more accurate.
Feature wish: Classification table, for example as in the package RecordLinkage.
A classification table (a.k.a. table of confusion or error matrix) is the traditional summary of linkage results and can also be used to calculate other summary measures than match count, match rate, FDR and FNR such as, for example, link count, and F-measure.
Anders
Hi Anders,
We are glad your issue is solved now.
Regarding your questions/request:
Yes, you are right, for most systems thread should be a better way to describe that option. We will try to incorporate such a change in a future release.
Currently, we are developing functions that will include detailed confusion tables and other ways to present graphically the results.
Note that if you use the wrapper function fastLink(), it is possible to obtain basic summary stats like match rate, FDR, and FNR. For the step-by-step implementation we do not have such functions yet, but as noted above, we are close to having them finished. We will keep you posted!
Thanks for using fastLink and please keep us posted on how your project goes. Again, any additional feedback would be greatly appreciated.
Ted
Hi Ted,
This is great news, thank you!
A feature related to the confusion table that also would be nice to have is a simple way to add the assumed true match status before doing the linkage. In the R package RecordLinkage this is done as two identity vectors which are harder than variables to work with for R beginners like myself.
I am a statistician at the central Florida Cancer Data System (FCDS), which is affiliated with University of Miami. At FCDS, I currently use Stata for pre-processing and for clerical review and the RecordLinkage package for the probabilistic record linkage itself. I am primarily a Stata user but Stata does not have a good program for probabilistic record linkage. Since you are close to finish an improved version, I will wait with additional feedback until you make it available.
Anders
On Thu, Aug 31, 2017 at 12:31 PM, tedenamorado notifications@github.com wrote:
Hi Anders,
We are glad your issue is solved now.
Regarding your questions/request:
1.
Yes, you are right, for most systems thread should be a better way to describe that option. We will try to incorporate such a change in a future release. 2.
Currently, we are developing functions that will include detailed confusion tables and other ways to present graphically the results.
Note that if you use the wrapper function fastLink(), it is possible to obtain basic summary stats like match rate, FDR, and FNR. For the step-by-step implementation we do not have such functions yet, but as noted above, we are close to having them finished. We will keep you posted!
Thanks for using fastLink and please keep us posted on how your project goes. Again, any additional feedback would be greatly appreciated.
Ted
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kosukeimai/fastLink/issues/15#issuecomment-326351293, or mute the thread https://github.com/notifications/unsubscribe-auth/AThPmpxArMmF0wKjGujUnh6ETYwiYtNeks5sdt_tgaJpZM4PH-Fi .
Hi Ted,
Since you posted version 0.2.0 already, I installed it on Windows and here is more feedback:
Thank you! I can use the new functions inspectEM() and plot() but I get this error when I try to get online help:
Error in fetch(key) : lazy-load database 'C:/Users/aalexandersson/Documents/R/win-library/3.4/fastLink/help/fastLink.rdb' is corrupt
Are inspectEM() and plot() what you meant by "detailed confusion tables"? I am wishing for a traditional 2*2 confusion table illustrating the four outcomes of data matching classification: true positives, false positives, false negatives, true negatives. Is there a simple way already to extract those four summary counts?
Anders
On Thu, Aug 31, 2017 at 4:14 PM, Anders Alexandersson andersalex@gmail.com wrote:
Hi Ted,
This is great news, thank you!
A feature related to the confusion table that also would be nice to have is a simple way to add the assumed true match status before doing the linkage. In the R package RecordLinkage this is done as two identity vectors which are harder than variables to work with for R beginners like myself.
I am a statistician at the central Florida Cancer Data System (FCDS), which is affiliated with University of Miami. At FCDS, I currently use Stata for pre-processing and for clerical review and the RecordLinkage package for the probabilistic record linkage itself. I am primarily a Stata user but Stata does not have a good program for probabilistic record linkage. Since you are close to finish an improved version, I will wait with additional feedback until you make it available.
Anders
On Thu, Aug 31, 2017 at 12:31 PM, tedenamorado notifications@github.com wrote:
Hi Anders,
We are glad your issue is solved now.
Regarding your questions/request:
1.
Yes, you are right, for most systems thread should be a better way to describe that option. We will try to incorporate such a change in a future release. 2.
Currently, we are developing functions that will include detailed confusion tables and other ways to present graphically the results.
Note that if you use the wrapper function fastLink(), it is possible to obtain basic summary stats like match rate, FDR, and FNR. For the step-by-step implementation we do not have such functions yet, but as noted above, we are close to having them finished. We will keep you posted!
Thanks for using fastLink and please keep us posted on how your project goes. Again, any additional feedback would be greatly appreciated.
Ted
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kosukeimai/fastLink/issues/15#issuecomment-326351293, or mute the thread https://github.com/notifications/unsubscribe-auth/AThPmpxArMmF0wKjGujUnh6ETYwiYtNeks5sdt_tgaJpZM4PH-Fi .
Email edits: (1) Changed subject title (2) For example, typing insp gives a popup window with "R code execution error", and the command line error message as described below.
On Fri, Sep 1, 2017 at 2:45 PM, Anders Alexandersson andersalex@gmail.com wrote:
Hi Ted,
Since you posted version 0.2.0 already, I installed it on Windows and here is more feedback:
Thank you! I can use the new functions inspectEM() and plot() but I get this error when I try to get online help:
Error in fetch(key) : lazy-load database 'C:/Users/aalexandersson/Documents/R/win-library/3.4/fastLink/help/fastLink.rdb' is corrupt
Are inspectEM() and plot() what you meant by "detailed confusion tables"? I am wishing for a traditional 2*2 confusion table illustrating the four outcomes of data matching classification: true positives, false positives, false negatives, true negatives. Is there a simple way already to extract those four summary counts?
Anders
On Thu, Aug 31, 2017 at 4:14 PM, Anders Alexandersson < andersalex@gmail.com> wrote:
Hi Ted,
This is great news, thank you!
A feature related to the confusion table that also would be nice to have is a simple way to add the assumed true match status before doing the linkage. In the R package RecordLinkage this is done as two identity vectors which are harder than variables to work with for R beginners like myself.
I am a statistician at the central Florida Cancer Data System (FCDS), which is affiliated with University of Miami. At FCDS, I currently use Stata for pre-processing and for clerical review and the RecordLinkage package for the probabilistic record linkage itself. I am primarily a Stata user but Stata does not have a good program for probabilistic record linkage. Since you are close to finish an improved version, I will wait with additional feedback until you make it available.
Anders
On Thu, Aug 31, 2017 at 12:31 PM, tedenamorado notifications@github.com wrote:
Hi Anders,
We are glad your issue is solved now.
Regarding your questions/request:
1.
Yes, you are right, for most systems thread should be a better way to describe that option. We will try to incorporate such a change in a future release. 2.
Currently, we are developing functions that will include detailed confusion tables and other ways to present graphically the results.
Note that if you use the wrapper function fastLink(), it is possible to obtain basic summary stats like match rate, FDR, and FNR. For the step-by-step implementation we do not have such functions yet, but as noted above, we are close to having them finished. We will keep you posted!
Thanks for using fastLink and please keep us posted on how your project goes. Again, any additional feedback would be greatly appreciated.
Ted
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kosukeimai/fastLink/issues/15#issuecomment-326351293, or mute the thread https://github.com/notifications/unsubscribe-auth/AThPmpxArMmF0wKjGujUnh6ETYwiYtNeks5sdt_tgaJpZM4PH-Fi .
Try uninstalling and then reinstalling the package to see if that fixes the problem.
Hi Anders,
I hope all is OK.
Are you still having these issues when using fastLink?
Are inspectEM() and plot() what you meant by "detailed confusion tables"? No, these functions are designed to make plots that present the agreement vectors in an easy-to-interpret fashion.
As per the confusion table, we are still working on such a function. I will let you know when we push it.
Thanks a lot for patience and all your feedback! We hope fastLink helps with the record linkage problem you are dealing with.
Ted
Hi Ted,
Sorry for being late in my reply. Everything is OK. I did not have a chance to reinstall the software at work yet because I am home taking the long weekend off. I do not expect the help file to remain a problem on my Windows 7 with a clean install. I am closing the issue. If the problem remains I will let you know.
Thank you so much for working on adding a confusion table! It would make the results more comparable to the R package RecordLinkage, and to traditional output. A confusion table is the main feature that I and Florida Cancer Data System miss. It would enable me to switch record linkage software at work from RecordLinkage to fastLink.
Best wishes, Anders
A clean re-installation of version 0.2.0 from CRAN fixed the problem with the corrupted help file.
Stata has a user-written command classtabi which concretely shows another example how the confusion matrix can be displayed. Unfortunately, the program has two minor bugs which are described on Statalist here:
Hope this helps, Anders
Ariel Linden has now updated his Stata program classtabi to fix the two bugs. Hopefully you find it useful for developing a similar confusion matrix in fastLink.
Thanks a lot for sharing this with us! We are close to release a new version of the package and we promise that the new function with a confusion table will be released then.
In addition, we are adding two new functions that will allow the users to compare numeric variables based on the absolute difference between them.
I am using fastLink on confidential data and get an error in mclapply(). I am using fastLink version 0.1.1 on Windows 7 with 4 cores.
This is the problematic R command:
This is the problematic R output:
Immediately after the error, I typed traceback() and this is the result:
If I change the syntax from n.cores = 2 to n.cores = 1 (or if I omit the option) then the R output is fine.
I could not reproduce the error on datasets dfA and dfB. The problem with mclapply() on Windows is discussed further at https://www.r-bloggers.com/implementing-mclapply-on-windows-a-primer-on-embarrassingly-parallel-computation-on-multicore-systems-with-r/
Please advice.