aalexandersson commented 6 years ago

Thank you for adding the confusion table in version 0.3.0. However, I think I found a bug due to a logical error for the additional statistics ($addition.info). These additional statistics from https://github.com/kosukeimai/fastLink/blob/master/R/confusion.R do not match up with Stata's classtabi command unless you reverse cells A and D.

Based on prior conversation, I assume the reason is that you use the same formulas but, as noted in the help file of classtabi under Remarks, in classtabi cell A = True Negative (TN) and cell D = True Positive (TP) whereas in confusion.R cell A = TP and cell D = TN. This is best illustrated with a simple reproducible example.

Example of confusion.R in fastLink:

# run fastLink() and get the confusion table
> library(fastLink)
> data(samplematch)
> 
> out <- fastLink(
+   dfA = dfA, dfB = dfB, 
+   varnames = c("firstname", "middlename", "lastname"),
+   stringdist.match = c("firstname", "middlename", "lastname"),
+   return.all=TRUE)
> ct <- confusion(out)

# display summary results
> summary(out)
                  95%     85%     75%   Exact
1 Match Count      50      50      50      43
2  Match Rate 14.225% 14.225% 14.225% 12.286%
3         FDR  0.426%  0.426%  0.426%        
4         FNR  1.378%  1.378%  1.378% 

> ct
$confusion.table
                     'True' Matches 'True' Non-Matches
Declared Matches               49.8                0.2
Declared Non-Matches            0.9              331.3

$addition.info
                                           results
Max Number of Obs to be Matched 382.19999999999999
Sensitivity (%)                  99.70000000000000
Specificity (%)                  99.59999999999999
Positive Predicted Value (%)     99.90000000000001
Negative Predicted Value (%)     98.09999999999999
False Positive Rate (%)           0.40000000000000
False Negative Rate (%)           0.30000000000000
Correctly Clasified (%)          99.70000000000000

Reproduce example using classtabi in Stata:

# Multiply all cells by 10 because classtabi requires integers
 classtabi 498 2 9 3313

           |          col
       row |         0          1 |     Total
-----------+----------------------+----------
         0 |       498          2 |       500 
         1 |         9      3,313 |     3,322 
-----------+----------------------+----------
     Total |       507      3,315 |     3,822 

-------------------------------------------------
Sensitivity                     D/(C+D)   99.73%      
Specificity                     A/(A+B)   99.60%      
Positive predictive value       D/(B+D)   99.94%      
Negative predictive value       A/(A+C)   98.22%      
-------------------------------------------------
False positive rate             B/(A+B)    0.40%      
False negative rate             C/(C+D)    0.27%      
-------------------------------------------------
Correctly classified      A+D/(A+B+C+D)   99.71%      
-------------------------------------------------
Effect strength for sensitivity           99.33%      
-------------------------------------------------
ROC area                                  0.9966      
-------------------------------------------------

To grasp what I think is a bug, due to logical error, we can we compare with Wikipedia. For example, the first mentioned statistics "Sensitivity" on Wikipedia and in Stata is defined as TP / (TP + FN) whereas you seem to define it as TN / (FN + TN). A more academic reference is Methodological Developments in Data Linkage by Harron, Goldstein, and Dibben. There, to use the same example (in chapter 4 on page 81) of "Sensitivity" again the definition is as in Stata and Wikipedia. The Stata example below reverses cells A and D to illustrate the difference and what I think are the correct results if we use standard terminology.

Show in Stata what the $addition.info should be:

 
# Multiply all cells by 10 because classtabi requires integers
# Required syntax:     classtabi #a #b #c #d
# Helpfile states: #a = TN, #b = FP, #c = FN, #d = TP
. classtabi 3313 2 9 498

           |          col
       row |         0          1 |     Total
-----------+----------------------+----------
         0 |     3,313          2 |     3,315 
         1 |         9        498 |       507 
-----------+----------------------+----------
     Total |     3,322        500 |     3,822 

-------------------------------------------------
Sensitivity                     D/(C+D)   98.22%      
Specificity                     A/(A+B)   99.94%      
Positive predictive value       D/(B+D)   99.60%      
Negative predictive value       A/(A+C)   99.73%      
-------------------------------------------------
False positive rate             B/(A+B)    0.06%      
False negative rate             C/(C+D)    1.78%      
-------------------------------------------------
Correctly classified      A+D/(A+B+C+D)   99.71%      
-------------------------------------------------
Effect strength for sensitivity           98.16%      
-------------------------------------------------
ROC area                                  0.9908      
-------------------------------------------------

tedenamorado commented 6 years ago

Thanks a lot for raising this point, Anders! You are totally right! That is a typo in the function and we will take care of that right away.

Ted

tedenamorado commented 6 years ago

Anders,

Just to let you know, we have fixed the issue. We will submit the fix to CRAN today.

All the best,

Ted

aalexandersson commented 6 years ago

Perfect, thanks Ted! Now I just need to figure out how to extract the matches after clusterMatch() into a dataset to be returned to the data requestor ...

aalexandersson commented 6 years ago

Thank you for trying to fix the bug in version 0.3.1 but, using the same example,

ct <- confusion(out)

now results in the error

object 'A' not found

For debugging, I used

debug(confusion)
ct <- confusion(out)

and the offending line is

C <- sum(object$posterior * ifelse(object$posterior < threshold, 
        1, 0)) + (min(object$nobs.a, object$nobs.b) - A) * 0.001

I think the reason for the error is that the line calls A which is not yet defined.

aalexandersson commented 6 years ago

Here is the complete output:

> library(fastLink)
> data(samplematch)
> out <- fastLink(
+   dfA = dfA, dfB = dfB, 
+   varnames = c("firstname", "middlename", "lastname"),
+   stringdist.match = c("firstname", "middlename", "lastname"),
+   return.all = TRUE)

==================== 
fastLink(): Fast Probabilistic Record Linkage
==================== 

Calculating matches for each variable.
Getting counts for zeta parameters.
    Parallelizing calculation using OpenMP. 1 threads out of 8 are used.
Running the EM algorithm.
Getting the indices of estimated matches.
    Parallelizing calculation using OpenMP. 1 threads out of 8 are used.
Deduping the estimated matches.
> ct <- confusion(out)
Error in confusion(out) : object 'A' not found
> debug(confusion)
> ct <- confusion(out)
debugging in: confusion(out)
debug: {
    if (!("confusionTable" %in% class(object))) {
        stop("You can only run 'confusion()' if 'return.all = TRUE' in 'fastLink()'.")
    }
    D <- sum(object$posterior * ifelse(object$posterior >= threshold, 
        1, 0))
    B <- sum((1 - object$posterior) * ifelse(object$posterior >= 
        threshold, 1, 0))
    C <- sum(object$posterior * ifelse(object$posterior < threshold, 
        1, 0)) + (min(object$nobs.a, object$nobs.b) - A) * 0.001
    A <- sum((1 - object$posterior) * ifelse(object$posterior < 
        threshold, 1, 0)) + (min(object$nobs.a, object$nobs.b) - 
        A) * (1 - 0.001)
    t1 <- round(rbind(c(A, B), c(C, D)), 1)
    colnames(t1) <- c("'True' Matches", "'True' Non-Matches")
    rownames(t1) <- c("Declared Matches", "Declared Non-Matches")
    N = A + B + C + D
    sens = 100 * D/(C + D)
    spec = 100 * A/(A + B)
    ppv = 100 * D/(B + D)
    npv = 100 * A/(A + C)
    fpr = 100 * B/(A + B)
    fnr = 100 * C/(C + D)
    acc = 100 * (A + D)/N
    t2 <- round(as.matrix(c(N, sens, spec, ppv, npv, fpr, fnr, 
        acc)), digits = 2)
    rownames(t2) <- c("Max Number of Obs to be Matched", "Sensitivity (%)", 
        "Specificity (%)", "Positive Predicted Value (%)", "Negative Predicted Value (%)", 
        "False Positive Rate (%)", "False Negative Rate (%)", 
        "Correctly Clasified (%)")
    colnames(t2) <- "results"
    results <- list()
    results$confusion.table <- t1
    results$addition.info <- round(t2, 1)
    return(results)
}
Browse[2]> 
debug: if (!("confusionTable" %in% class(object))) {
    stop("You can only run 'confusion()' if 'return.all = TRUE' in 'fastLink()'.")
}
Browse[2]> 
debug: D <- sum(object$posterior * ifelse(object$posterior >= threshold, 
    1, 0))
Browse[2]> 
debug: B <- sum((1 - object$posterior) * ifelse(object$posterior >= 
    threshold, 1, 0))
Browse[2]> 
debug: C <- sum(object$posterior * ifelse(object$posterior < threshold, 
    1, 0)) + (min(object$nobs.a, object$nobs.b) - A) * 0.001
Browse[2]> 
Error in confusion(out) : object 'A' not found
> Q
Error: object 'Q' not found
> q
function (save = "default", status = 0, runLast = TRUE) 
.Internal(quit(save, status, runLast))


>

aalexandersson commented 6 years ago

Better is to use debugonce() instead of debug(). This will only take the function through debug mode once.

There are two small other issues with the confusion table in version 0.3.0 which possibly might be a problem later when I want to extract one dataset with the matches after clusterMatch: 1) Rounding e.g., "False negative rate" was 0.30000000000000 rather than Stata's 0.27. 2) dfA and dfB had minor differences according to all.equal(dfA,dfB) but nothing that Stata's command cf detected. It is probably safer to get rid of all differences though before running clusterMatch.

tedenamorado commented 6 years ago

Hi Anders,

Thanks for such a detailed comment! We are fixing the bug right now. We will push a new version for the confusion table function soon (before the end of the day). You will be able to install it through install_github.

If anything else, please keep us posted.

Ted

aalexandersson commented 6 years ago

Thank you Ted,

Using the same example, this is now the output for the confusion table and additional info:

      
> ct
$confusion.table
                     'True' Matches 'True' Non-Matches
Declared Matches              49.79               0.21
Declared Non-Matches           0.27             299.73

$addition.info
                                           results
Max Number of Obs to be Matched 350.00000000000000
Sensitivity (%)                  99.50000000000000
Specificity (%)                  99.90000000000001
Positive Predicted Value (%)     99.59999999999999
Negative Predicted Value (%)     99.90000000000001
False Positive Rate (%)           0.10000000000000
False Negative Rate (%)           0.50000000000000
Correctly Clasified (%)          99.90000000000001

The issue has been fixed in that the additional info matches Stata:

. classtabi 29973 21 27 4979

           |          col
       row |         0          1 |     Total
-----------+----------------------+----------
         0 |    29,973         21 |    29,994 
         1 |        27      4,979 |     5,006 
-----------+----------------------+----------
     Total |    30,000      5,000 |    35,000 

-------------------------------------------------
Sensitivity                     D/(C+D)   99.46%      
Specificity                     A/(A+B)   99.93%      
Positive predictive value       D/(B+D)   99.58%      
Negative predictive value       A/(A+C)   99.91%      
-------------------------------------------------
False positive rate             B/(A+B)    0.07%      
False negative rate             C/(C+D)    0.54%      
-------------------------------------------------
Correctly classified      A+D/(A+B+C+D)   99.86%      
-------------------------------------------------
Effect strength for sensitivity           99.39%      
-------------------------------------------------
ROC area                                  0.9970      
-------------------------------------------------

.

The rounding issue remains but I am fine with closing #22. Thanks again.

tedenamorado commented 6 years ago

Hi Anders,

We just pushed a new version of the function. Again, thanks a lot for your great feedback.

If anything, please let us know.

Ted

aalexandersson commented 6 years ago

Hi Ted,

The new confusion() function looks good to me. A possible feature enhancement for $addition.info is to add F1 Score = (2 PPV Sensitivity) / (PPV + Sensitivity)

Two justifications: 1) Wikipedia: https://en.wikipedia.org/wiki/F1_score 2) Academic reference: Hand, D. and Christen, P. 2017 "A note on using the F-measure for evaluating record linkage algorithms". Statistics and Computing, (April) 1-9. (Accessed at https://link.springer.com/content/pdf/10.1007%2Fs11222-017-9746-6.pdf)

tedenamorado commented 6 years ago

Hi Anders,

We totally agree! We just added the F1-score to the confusion table function.

If anything, just let us know.

Ted

tedenamorado commented 6 years ago

Hi Anders,

Just to let you know that my co-author Ben Fifield just pushed a new function called getPatterns. That function recovers the agreement patterns for each matched observation.

Ted

aalexandersson commented 6 years ago

Hi Ted,

Thank you. I just recovered from a bad cold but I expect to test it next week. Does it work after clusterMatch()? Recall I need to use clusterMatch() since our patient dataset has over 3 million observations.

Anders

On Wed, Feb 28, 2018 at 12:36 PM, Ted Enamorado notifications@github.com wrote:

Hi Anders,

Just to let you know that my co-author Ben Fifield just pushed a new function called getPatterns. That function recovers the agreement patterns for each matched observation.

Ted

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/kosukeimai/fastLink/issues/22#issuecomment-369318626, or mute the thread https://github.com/notifications/unsubscribe-auth/AThPmsqVGK9JGKPFBzCr2lMB25pY9wZ9ks5tZY6BgaJpZM4R0LiC .

tedenamorado commented 6 years ago

Hi Anders,

I am glad you are doing better now! Yes, the new function is compatible with clusterMatch(). Once you get to it, please let us know if we can be of any help.

I hope you get well soon!

Ted

aalexandersson commented 6 years ago

Hi Ted,

I now use the latest development version 0.3.2 (on Windows 10, 64-bit) but when I run

library(devtools) install_github("kosukeimai/fastLink",dependencies=TRUE) library(fastLink) data(samplematch) out <- fastLink( dfA = dfA, dfB = dfB, varnames = c("firstname", "middlename", "lastname"), stringdist.match = c("firstname", "middlename", "lastname"), return.all = TRUE)

I now get this error from fastLink():

Error in [.data.frame(dfA, matches$inds.a) : undefined columns selected

I did not have this error in version 0.3.1. Please advise.

Anders

On Wed, Feb 28, 2018 at 4:51 PM, Ted Enamorado notifications@github.com wrote:

Hi Anders,

I am glad you are doing better now! Yes, the new function is compatible with clusterMatch(). Once you get to it, please let us know if we can of any help.

I hope you get well soon!

Ted

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/kosukeimai/fastLink/issues/22#issuecomment-369395463, or mute the thread https://github.com/notifications/unsubscribe-auth/AThPmnrtMq4S55yceeiRfusi0Mt0Gw7Kks5tZcpQgaJpZM4R0LiC .

aalexandersson commented 6 years ago

I have Windows 7 (not Windows 10):

sessionInfo() R version 3.4.3 (2017-11-30) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1

On Tue, Mar 6, 2018 at 8:41 AM, Anders Alexandersson andersalex@gmail.com wrote:

Hi Ted,

I now use the latest development version 0.3.2 (on Windows 10, 64-bit) but when I run

library(devtools) install_github("kosukeimai/fastLink",dependencies=TRUE) library(fastLink) data(samplematch) out <- fastLink( dfA = dfA, dfB = dfB, varnames = c("firstname", "middlename", "lastname"), stringdist.match = c("firstname", "middlename", "lastname"), return.all = TRUE)

I now get this error from fastLink():

Error in [.data.frame(dfA, matches$inds.a) : undefined columns selected

I did not have this error in version 0.3.1. Please advise.

Anders

On Wed, Feb 28, 2018 at 4:51 PM, Ted Enamorado notifications@github.com wrote:

Hi Anders,

I am glad you are doing better now! Yes, the new function is compatible with clusterMatch(). Once you get to it, please let us know if we can of any help.

I hope you get well soon!

Ted

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/kosukeimai/fastLink/issues/22#issuecomment-369395463, or mute the thread https://github.com/notifications/unsubscribe-auth/AThPmnrtMq4S55yceeiRfusi0Mt0Gw7Kks5tZcpQgaJpZM4R0LiC .

aalexandersson commented 6 years ago

Even the confusion table in CRAN version 0.3.1 no longer works. Now this code that worked before

install.packages("fastLink") library(fastLink) data(samplematch) out <- fastLink( dfA = dfA, dfB = dfB, varnames = c("firstname", "middlename", "lastname"), stringdist.match = c("firstname", "middlename", "lastname"), return.all = TRUE) ct <- confusion(out)

results in the error message

Error in confusion(out) : object 'A' not found

On Tue, Mar 6, 2018 at 8:44 AM, Anders Alexandersson andersalex@gmail.com wrote:

I have Windows 7 (not Windows 10):

sessionInfo() R version 3.4.3 (2017-11-30) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1

On Tue, Mar 6, 2018 at 8:41 AM, Anders Alexandersson <andersalex@gmail.com

wrote:

Hi Ted,

I now use the latest development version 0.3.2 (on Windows 10, 64-bit) but when I run

library(devtools) install_github("kosukeimai/fastLink",dependencies=TRUE) library(fastLink) data(samplematch) out <- fastLink( dfA = dfA, dfB = dfB, varnames = c("firstname", "middlename", "lastname"), stringdist.match = c("firstname", "middlename", "lastname"), return.all = TRUE)

I now get this error from fastLink():

Error in [.data.frame(dfA, matches$inds.a) : undefined columns selected

I did not have this error in version 0.3.1. Please advise.

Anders

On Wed, Feb 28, 2018 at 4:51 PM, Ted Enamorado notifications@github.com wrote:

Hi Anders,

I am glad you are doing better now! Yes, the new function is compatible with clusterMatch(). Once you get to it, please let us know if we can of any help.

I hope you get well soon!

Ted

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/kosukeimai/fastLink/issues/22#issuecomment-369395463, or mute the thread https://github.com/notifications/unsubscribe-auth/AThPmnrtMq4S55yceeiRfusi0Mt0Gw7Kks5tZcpQgaJpZM4R0LiC .

aalexandersson commented 6 years ago

The recurring error with the CRAN version 0.3.1 is the same as from Feb 5:

debug(confusion)
ct <- confusion(out)

C <- sum(object$posterior * ifelse(object$posterior < threshold, 
        1, 0)) + (min(object$nobs.a, object$nobs.b) - A) * 0.001

tedenamorado commented 6 years ago

Hi Anders,

We will take a look at the issue right now. I will be back with a proper answer shortly.

Ted

tedenamorado commented 6 years ago

Hi Anders,

We have fixed the issue for the development branch. It was a minor typo in the code. Please install fastLink again using devtools.

Keep us posted!

Ted

aalexandersson commented 6 years ago

Thanks Ted,

Now I get a new error message in the development branch.

out <- fastLink( dfA = dfA, dfB = dfB, varnames = c("firstname", "middlename", "lastname"), stringdist.match = c("firstname", "middlename", "lastname"), return.all = TRUE)

results in this error:

Error in m_func_par(temp = temp, ptemp = ptemp, natemp = natemp, limit1 = limit.1, : object '_fastLink_m_func_par' not found

Anders

On Tue, Mar 6, 2018 at 3:29 PM, Ted Enamorado notifications@github.com wrote:

Hi Anders,

We have fixed the issue for the development branch. It was a minor typo in the code. Please install fastLink again using devtools.

Keep us posted!

Ted

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/kosukeimai/fastLink/issues/22#issuecomment-370917130, or mute the thread https://github.com/notifications/unsubscribe-auth/AThPmgvtgICX4Cuy_ZiZKfYR2u31JQTGks5tbvGXgaJpZM4R0LiC .

tedenamorado commented 6 years ago

Hi Anders,

I have installed the package using a machine with similar specification as yours and I do not get the error. Do you get any error message during installation?

Ted

aalexandersson commented 6 years ago

Hi Ted,

Thanks for the help, and I am sorry for having caused you extra work because of my user error. The problem has been resolved. No, I did not get any error message during installation. However, I had done several installations of both various development versions and CRAN versions without a clean installation which seems to have caused the problem. Adding this command prior to installation fixed the problem:

remove.packages('fastLink')

(Is that the best way to get a clean installation?) Then, I could get the summary statistics and the confusion table again with these commands:

library(devtools)
install_github("kosukeimai/fastLink",dependencies=TRUE)
library(fastLink)
data(samplematch) 
out <- fastLink(
  dfA = dfA, dfB = dfB, 
  varnames = c("firstname", "middlename", "lastname"),
  stringdist.match = c("firstname", "middlename", "lastname"),
  return.all = TRUE)
ct <- confusion(out)
summary(out)
ct

Since the problem has been resolved, I will close the issue.

Now, I finally am at the stage where I can try to test the new getPatterns() function. How do I use the new getPatterns() function given the commands I typed? It is not obvious. I will open a new issue if I cannot figure it out.

Thanks again! Anders

bfifield commented 6 years ago

Hi Anders -

There should be a new object in out called patterns, that will return the match patterns. Using your example:

library(fastLink)
data(samplematch) 
out <- fastLink(
  dfA = dfA, dfB = dfB, 
  varnames = c("firstname", "middlename", "lastname"),
  stringdist.match = c("firstname", "middlename", "lastname"),
  return.all = TRUE)
out$patterns

bfifield commented 6 years ago

And Anders, I've just pushed a change so that the match patterns correspond exactly with the de-duped matches. Previously, it returned the un-deduped match patterns.

aalexandersson commented 6 years ago

Hi Ben,

Thank you. In the example, the two dataframes out$matches and out$patterns now have the same number of observations, 82, which makes sense. However, in practice I want only a subset of the matches.

How can I get a subset of the matches such as for the default 85% threshold?

I can get the original data for the 82 matched observations like this:

matchesA <- dfA[out$matches$inds.a,] matchesB <- dfB[out$matches$inds.b,]

However, in the example I want only the 50 matches that corresponds to the default 85% threshold (not all 82 matches).

> summary(out)
                  95%     85%     75%   Exact
1 Match Count      50      50      50      43
2  Match Rate 14.225% 14.225% 14.225% 12.286%
3         FDR  0.426%  0.426%  0.426%        
4         FNR  1.381%  1.381%  1.381%

Anders

bfifield commented 6 years ago

You can subset down further using the posterior entry in the out object, which contains the estimated posterior match probability for each matched pair:

library(fastLink)
data(samplematch) 
out <- fastLink(
  dfA = dfA, dfB = dfB, 
  varnames = c("firstname", "middlename", "lastname"),
  stringdist.match = c("firstname", "middlename", "lastname"),
  return.all = TRUE)

## Get all matches
matchesA <- dfA[out$matches$inds.a,]
matchesB <- dfB[out$matches$inds.b,]

## Get all matches above threshold
dim(matchesA[out$posterior >= .85,])
dim(matchesB[out$posterior >= .85,])

head(matchesA[out$posterior >= .85,])
head(matchesB[out$posterior >= .85,])

## Look at patterns above threshold
out$patterns[out$posterior >= .85,]

aalexandersson commented 6 years ago

Ben,

Thank you. That's exactly what I need for most data requests. :1st_place_medal:

Anders

bfifield commented 6 years ago

No problem! Glad to be able to help.

aalexandersson commented 6 years ago

How are FDR and FNR calculated in summary()? I just realized that FNR differs from "False Negative Rate" in confusion(). Similarly, FDR and "False Positive Rate" differ. Are they four different concepts? Example code of linkage:

library(fastLink)
data(samplematch) 
out <- fastLink(
  dfA = dfA, dfB = dfB, 
  varnames = c("firstname", "middlename", "lastname"),
  stringdist.match = c("firstname", "middlename", "lastname"),
  return.all = TRUE)

Example code and output:

> summary(out)
                  95%     85%     75%   Exact
1 Match Count      50      50      50      43
2  Match Rate 14.225% 14.225% 14.225% 12.286%
3         FDR  0.426%  0.426%  0.426%        
4         FNR  1.381%  1.381%  1.381%        
> confusion(out)
$confusion.table
                     'True' Matches 'True' Non-Matches
Declared Matches              49.79               0.21
Declared Non-Matches           0.27             299.73

$addition.info
                                results
Max Number of Obs to be Matched  350.00
Sensitivity (%)                   99.46
Specificity (%)                   99.93
Positive Predicted Value (%)      99.57
Negative Predicted Value (%)      99.91
False Positive Rate (%)            0.07
False Negative Rate (%)            0.54
Correctly Clasified (%)           99.86
F1 Score (%)                      99.52

Stata gives same results as $addition.info:

. classtabi 29973 21 27 4979

(irrelevant output omitted here) 

False positive rate             B/(A+B)    0.07%      
False negative rate             C/(C+D)    0.54% 

(irrelevant output omitted here)

tedenamorado commented 6 years ago

Hi Anders,

The concepts are the same but the data we use to calculate FDR and FNR are different. In the function confusion() the numbers you see are relative to the size of the datasets after enforcing a 1 to 1 matching status i.e., every observation in dataset A (the smaller one) can be matched with at most one observation in dataset B (the larger one). Note that this does not mean that every observation in A can be located in B (that is why we have the probability of being a match to weight results), but that in the case where there is a 100% overlap between datasets, we could end up matching one observation in A to one observation in B.

In the summary function, we use the total number of pairs between two datasets to construct the same rates. We are working in adjusting the numbers from both functions to be consistent and to match the confusion table numbers as it is the approach we argue in favor of in our paper.

If anything remains unclear, please do not hesitate to let us know.

All the best,

Ted

aalexandersson commented 6 years ago

Hi Ted,

Yes, this makes sense. Thanks again and best wishes.

Anders

kosukeimai / fastLink

Confusing $addition.info from new function confusion() #22

Example of confusion.R in fastLink:

Reproduce example using classtabi in Stata:

Show in Stata what the $addition.info should be: