kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
253 stars 46 forks source link

Col::subvec() error with some data #47

Open muranyia opened 3 years ago

muranyia commented 3 years ago

I can run fastLink() stringdist.match on two variables on my datasets with up to 20k rows but with more rows, I get the following error (and a crash in RStudio) during the "Getting counts for parameter estimation" phase:

error: Col::subvec(): indices out of bounds or incorrectly used
terminate called after throwing an instance of 'std::logic_error'
  what():  Col::subvec(): indices out of bounds or incorrectly used
aalexandersson commented 3 years ago

I am only a regular user but std::logic_error seems like a C++ error to me. I assume it would be helpful if you provide more details such as operating system (e.g., Windows or Linux), versions of R and fastLink, and the fastLink syntax that you used. Are the datasets confidential, or could you share them with the developers if needed to reproduce the error?

muranyia commented 3 years ago

@aalexandersson, you are right, I simply forgot to add these:

fastLink(data.table.1,
              data.table.2,
              varnames=c("FullName", "EMail"),
              stringdist.match=c("FullName", "EMail"),
              cut.a=.985,
              dedupe.matches=TRUE,
              verbose=TRUE,
              return.df=TRUE,
              n.cores=4)
platform       x86_64-pc-linux-gnu         
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          4                           
minor          0.3                         
year           2020                        
month          10                          
day            10                          
svn rev        79318                       
language       R                           
version.string R version 4.0.3 (2020-10-10)
nickname       Bunny-Wunnies Freak Out   
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.utf8         LC_COLLATE=en_US.UTF-8     LC_MONETARY=hu_HU.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=hu_HU.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=hu_HU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] fastLink_0.6.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5           compiler_4.0.3       pillar_1.4.6         iterators_1.0.13     prettyunits_1.1.1    FactoClass_1.2.7     tools_4.0.3          progress_1.2.2      
 [9] lifecycle_0.2.0      tibble_3.0.4         gtable_0.3.0         lattice_0.20-41      pkgconfig_2.0.3      rlang_0.4.8          foreach_1.5.1        Matrix_1.2-18       
[17] rstudioapi_0.11      parallel_4.0.3       ggrepel_0.8.2        xfun_0.19            stringr_1.4.0        dplyr_1.0.2          gtools_3.8.2         generics_0.1.0      
[25] vctrs_0.3.4          hms_0.5.3            ade4_1.7-16          grid_4.0.3           tidyselect_1.1.0     scatterplot3d_0.3-41 glue_1.4.2           data.table_1.13.2   
[33] R6_2.5.0             plotrix_3.7-8        adagio_0.7.1         ggplot2_3.3.2        purrr_0.3.4          magrittr_1.5         codetools_0.2-16     scales_1.1.1        
[41] ellipsis_0.3.1       MASS_7.3-53          stringdist_0.9.6.3   colorspace_1.4-1     xtable_1.8-4         tinytex_0.27         KernSmooth_2.23-17   stringi_1.5.3       
[49] munsell_0.5.0        doParallel_1.0.16    crayon_1.3.4        

The datasets are confidential, but I'm trying to narrow down the problem to a reproducible subset. What I've already realized that it's not about the amount of data, rather about failing on some cases an not on others. The offending cases just happened to occur after the first 20k rows.

I'll come back with more details, and I'll appreciate and follow advice in order to diagnose.

aalexandersson commented 3 years ago

Yes, it would be great to have a reproducible problem. Do you need both variables to reproduce the problem?

muranyia commented 3 years ago

Here's the minimal example:

library(fastLink)

# This works:
dt1 <- data.frame(id=1:200, EMail=c(rep(NA, 199), "foo")) # Not all cases are NA
dt2 <- data.frame(id=1:200, EMail=stringi::stri_rand_strings(200, 10))
flout <- fastLink(dt1,
                  dt2,
                  varnames=c("EMail"),
                  verbose=TRUE)

# This triggers an error (and crashes Rstudio)
dt1 <- data.frame(id=1:200, EMail=rep(NA, 200)) # All cases are NA
dt2 <- data.frame(id=1:200, EMail=stringi::stri_rand_strings(200, 10))
flout <- fastLink(dt1,
                  dt2,
                  varnames=c("EMail"),
                  verbose=TRUE)

Please note that in my original real-life situation, NOT all cases were NA, and I was working with a lot more cases.

aalexandersson commented 3 years ago

Do you have one or more rows of data where all linkage variables (i.e., FullName and EMail) are NA? I recommend you to drop those rows of data before using fastLink.

I once experienced a similar fatal error using fastLink when I mistakenly blocked on a variable with NA.

tedenamorado commented 3 years ago

Thanks to both of you!

As @aalexandersson mentions, the problem is that you have a variable with no variation. We have warnings when there is no variation in a variable for which we observe at least one value, but we do not have warnings when there is no variation due toNAs.

We will take a close look at this issue in the coming days and let you know when we have pushed a fix. In the meantime, @aalexandersson's suggestion is the way to go.

All my best,

Ted

muranyia commented 3 years ago

I believe I have dropped the cases where both variables are NA (please correct me if not):

> nrow(data.table.1)
[1] 138401
> data.table.1 <- data.table.1[!(is.na(data.table.1$FullName) & is.na(data.table.1$EMail)),]
> nrow(data.table.1)
[1] 138401

> nrow(data.table.2)
[1] 23310
> data.table.2 <- data.table.2[!(is.na(data.table.2$FullName) & is.na(data.table.2$EMail)),]
> nrow(data.table.2)
[1] 15417

I'm still getting the same error.

Those cases where either but not both variable is NA, I'd need to keep.

@tedenamorado Thank You for looking into this. On my full dataset, the variation is definitely there for both variables, but they indeed contain NAs. My minimal example above used a variable with no variation due to NAs, which doesn't apply to my full dataset.

@aalexandersson I really appreciate your help!

tedenamorado commented 3 years ago

Hi @muranyia,

No problem!

Quick question: what happens if you set cut.a to a lowe value? e.g.,

fastLink(data.table.1, data.table.2,
              varnames=c("FullName", "EMail"),
              stringdist.match=c("FullName", "EMail"),
              cut.a=0.90,
              dedupe.matches=TRUE,
              verbose=TRUE,
              return.df=TRUE,
              n.cores=4)

Thanks!

Ted

kosukeimai commented 3 years ago

@muranyia One more possibility: maybe, you forgot to remove @ from the email addresses?

tedenamorado commented 3 years ago

@muranyia, following on @kosukeimai's suggestion, I would try to divide the emails into components e.g., username, email service, and domain. I would expect little to no typographical errors in the last two components, but usernames (similar to the case of names) are more prone to errors.

muranyia commented 3 years ago

I still get an error with cut.a=0.90. BTW, I had most posterior probabilities way higher than 0.99 on my first 20k cases, some of them being 1 (I guess, due to precision) even when the matches were not exact.

The '@'s I did not remove. Is that a requirement?

tedenamorado commented 3 years ago

Thanks for getting back to us! Parsing the emails as discussed above might help.

Quick question: When you say 20K cases, that is referring to the larger dataset, right?

Let's try to debug the issue by trying to match one column at time:

fastLink(data.table.1, data.table.2,
              varnames=c("FullName"),
              stringdist.match=c("FullName"),
              cut.a=0.90,
              dedupe.matches=TRUE,
              verbose=TRUE,
              return.df=TRUE,
              n.cores=4)

Then run this:

fastLink(data.table.1, data.table.2,
              varnames=c("EMail"),
              stringdist.match=c("EMail"),
              cut.a=0.90,
              dedupe.matches=TRUE,
              verbose=TRUE,
              return.df=TRUE,
              n.cores=4)

Keep us posted!

Ted

muranyia commented 3 years ago

Yes the 20K+ cases are my full dataset.

Trying to match one column at a time, I got the same error for EMail and a different one for FullName. I couldn't copy and paste as my system collapsed (possibly due to full /temp ramdisk?), it was about emlinkMARmov -> sort() not being able to open a file.

tedenamorado commented 3 years ago

Hi @muranyia ,

It looks like the error you are getting comes from EMail. To check if comparisons are possible, we can try the following code:

g1 <- gammaCK2par(data.table.1$FullName, data.table.2$FullName, cut.a = 0.90, n.cores = 4)
g2 <- gammaCK2par(data.table.1$EMail, data.table.2$EMail, cut.a = 0.90, n.cores = 4)

Sorry if it is taking a few iterations, but I am sure we will figure out what the problem is.

Ted

aalexandersson commented 3 years ago

@muranyia It is also a good idea to restart R if you are running low on resources or get strange results. I am not sure what the best way to do it is but I usually do it from the RStudio menus: "Session" -> "Restart R". A reason why @ could be problematic in string comparisons is that it is a non-ASCII character,

I fully agree with Ted.

Anders

muranyia commented 3 years ago

@tedenamorado both commands finished but then my system collapsed. I guess memory/ramdisk exhaustion is more likely than a memory leak. Are you interested in a rerun while monitoring memory/disks?

@aalexandersson I always restarted R and I abandoned Rstudio for the tests and used CLI instead. As for @ note that I had no problems whatsoever with the first 20k cases, in fact the algorithm made much sense with the email addresses.

tedenamorado commented 3 years ago

Hi @muranyia,

It looks like parsing is much needed here. Does the variable FullName contain first and last name together? If so, I would separate those pieces of information e.g., first, middle, last name. The r-package humaniformat can be quite useful for this. Similarly, for emails, I would do what I mentioned above, break the email into components (username, email service, domain). Then you can try fastLink using all the parsed components of the names and email addresses.

The problem now is that by running the algorithm on long strings, you will need more memory. This would be solved if you parse the information in those two fields.

Keep us posted!

Ted

muranyia commented 3 years ago

Dear @tedenamorado,

Thank you for the advice. For the emails, I have no problem separating the parts, and in fact, I should not need string distance matching on them at all. With the names, the challenge is that I'm working with data from a buch of European countries, where there are so many different ways to record the same name (Lastname Firstname, Firstname Lastname, HusbandsLastname Lastname Firstname, HusbandsLastname Firstname, etc. and we're not even at abbreviations). Fastlink helps me a lot finding possible matches in an intelligent probabilistic way. Separating the parts, I'm afraid I'd lose that advantage. Having said that, I'm OK with experimenting and that's exactly what my task at hand is.

What I'd need is that the function doesn't crash. It's OK to run out of memory as long as the error is handled - other packages do that too.

I'll be more than happy to assist you with "bulletproofing" and to do the tests that you need. I think for starters I will rerun the commands and monitor the resources, just give me a few days as I cannot afford bringing down the whole system right now.

Thank you so much for the awesome package and for your help!

muranyia commented 3 years ago

I was curious so I did the rerun already.

My observations:

I really hope this is not because of faulty hardware, as I wouldn't want to take up your time for something that is specific to my system. Since it only ever occurs when running the specific command, I trust it isn't.

aalexandersson commented 3 years ago

@muranyia String distance measures of long strings such as full name or full address is not likely to be feasible unless you you use manageable parts. Even a more consistent string such as 9-digit Social Security Number likely will cause similar memory crashes unless you work on small datasets.

muranyia commented 3 years ago

Updates from more experiments:

Notably, I cannot reproduce the problem by trying to sythetically reproduce my datasets (with regards to the amount of columns, rows, NAs, etc.).

What could be the culprit?

tedenamorado commented 3 years ago

Hi @muranyia,

Thanks for your detailed feedback! If you have tried the code on more than one computer, then one possible avenue is to run the code in R, not RStudio. A few times RStudio has crashed for me for reasons I do not understand, but the same does to happen with R.

Ted

aalexandersson commented 3 years ago

@muranyia What is the smallest reproducible example of the error so far, in terms of number of observations for the Email variable?

You wrote:

And started to work when I filtered out cases with NAs.

This FAQ might help: faq-how-to-do-a-minimal-reproducible-example-reprex-for-beginners

Anders

muranyia commented 3 years ago

@aalexandersson The only reliable example so far is that in my older comment. When not all cases are NA, a haven't been able to craft a reliable one, although the synthesized versions of my real-life datasets induce the crash somewhat reliably.

I did notice that the computation time increases "exponentially" with the amount of NAs.

aalexandersson commented 3 years ago

@muranyia Please summarize the missingness (NAs) in each linkage variable. I use and recommend naniar::miss_var_summary() for that purpose.

My own experience is that fastLink usually performs well up to about 30% NAs. Thereafter, as you experienced, the computation time increases "very quickly" (maybe exponentially) with more NAs. Therefore, I tend not to include linkage variables with over 30% NAs. In those difficult situations with over 30% NAs, I tend to either drop some rows with NAs or drop the problematic variable from the linkage as a workaround.