Open muranyia opened 3 years ago
I am only a regular user but std::logic_error seems like a C++ error to me. I assume it would be helpful if you provide more details such as operating system (e.g., Windows or Linux), versions of R and fastLink, and the fastLink syntax that you used. Are the datasets confidential, or could you share them with the developers if needed to reproduce the error?
@aalexandersson, you are right, I simply forgot to add these:
fastLink(data.table.1,
data.table.2,
varnames=c("FullName", "EMail"),
stringdist.match=c("FullName", "EMail"),
cut.a=.985,
dedupe.matches=TRUE,
verbose=TRUE,
return.df=TRUE,
n.cores=4)
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 4
minor 0.3
year 2020
month 10
day 10
svn rev 79318
language R
version.string R version 4.0.3 (2020-10-10)
nickname Bunny-Wunnies Freak Out
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.utf8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=hu_HU.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=hu_HU.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=hu_HU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] fastLink_0.6.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 compiler_4.0.3 pillar_1.4.6 iterators_1.0.13 prettyunits_1.1.1 FactoClass_1.2.7 tools_4.0.3 progress_1.2.2
[9] lifecycle_0.2.0 tibble_3.0.4 gtable_0.3.0 lattice_0.20-41 pkgconfig_2.0.3 rlang_0.4.8 foreach_1.5.1 Matrix_1.2-18
[17] rstudioapi_0.11 parallel_4.0.3 ggrepel_0.8.2 xfun_0.19 stringr_1.4.0 dplyr_1.0.2 gtools_3.8.2 generics_0.1.0
[25] vctrs_0.3.4 hms_0.5.3 ade4_1.7-16 grid_4.0.3 tidyselect_1.1.0 scatterplot3d_0.3-41 glue_1.4.2 data.table_1.13.2
[33] R6_2.5.0 plotrix_3.7-8 adagio_0.7.1 ggplot2_3.3.2 purrr_0.3.4 magrittr_1.5 codetools_0.2-16 scales_1.1.1
[41] ellipsis_0.3.1 MASS_7.3-53 stringdist_0.9.6.3 colorspace_1.4-1 xtable_1.8-4 tinytex_0.27 KernSmooth_2.23-17 stringi_1.5.3
[49] munsell_0.5.0 doParallel_1.0.16 crayon_1.3.4
The datasets are confidential, but I'm trying to narrow down the problem to a reproducible subset. What I've already realized that it's not about the amount of data, rather about failing on some cases an not on others. The offending cases just happened to occur after the first 20k rows.
I'll come back with more details, and I'll appreciate and follow advice in order to diagnose.
Yes, it would be great to have a reproducible problem. Do you need both variables to reproduce the problem?
Here's the minimal example:
library(fastLink)
# This works:
dt1 <- data.frame(id=1:200, EMail=c(rep(NA, 199), "foo")) # Not all cases are NA
dt2 <- data.frame(id=1:200, EMail=stringi::stri_rand_strings(200, 10))
flout <- fastLink(dt1,
dt2,
varnames=c("EMail"),
verbose=TRUE)
# This triggers an error (and crashes Rstudio)
dt1 <- data.frame(id=1:200, EMail=rep(NA, 200)) # All cases are NA
dt2 <- data.frame(id=1:200, EMail=stringi::stri_rand_strings(200, 10))
flout <- fastLink(dt1,
dt2,
varnames=c("EMail"),
verbose=TRUE)
Please note that in my original real-life situation, NOT all cases were NA, and I was working with a lot more cases.
Do you have one or more rows of data where all linkage variables (i.e., FullName and EMail) are NA? I recommend you to drop those rows of data before using fastLink.
I once experienced a similar fatal error using fastLink when I mistakenly blocked on a variable with NA.
Thanks to both of you!
As @aalexandersson mentions, the problem is that you have a variable with no variation. We have warnings when there is no variation in a variable for which we observe at least one value, but we do not have warnings when there is no variation due toNA
s.
We will take a close look at this issue in the coming days and let you know when we have pushed a fix. In the meantime, @aalexandersson's suggestion is the way to go.
All my best,
Ted
I believe I have dropped the cases where both variables are NA (please correct me if not):
> nrow(data.table.1)
[1] 138401
> data.table.1 <- data.table.1[!(is.na(data.table.1$FullName) & is.na(data.table.1$EMail)),]
> nrow(data.table.1)
[1] 138401
> nrow(data.table.2)
[1] 23310
> data.table.2 <- data.table.2[!(is.na(data.table.2$FullName) & is.na(data.table.2$EMail)),]
> nrow(data.table.2)
[1] 15417
I'm still getting the same error.
Those cases where either but not both variable is NA, I'd need to keep.
@tedenamorado Thank You for looking into this. On my full dataset, the variation is definitely there for both variables, but they indeed contain NAs. My minimal example above used a variable with no variation due to NAs, which doesn't apply to my full dataset.
@aalexandersson I really appreciate your help!
Hi @muranyia,
No problem!
Quick question: what happens if you set cut.a to a lowe value? e.g.,
fastLink(data.table.1, data.table.2,
varnames=c("FullName", "EMail"),
stringdist.match=c("FullName", "EMail"),
cut.a=0.90,
dedupe.matches=TRUE,
verbose=TRUE,
return.df=TRUE,
n.cores=4)
Thanks!
Ted
@muranyia One more possibility: maybe, you forgot to remove @ from the email addresses?
@muranyia, following on @kosukeimai's suggestion, I would try to divide the emails into components e.g., username, email service, and domain. I would expect little to no typographical errors in the last two components, but usernames (similar to the case of names) are more prone to errors.
I still get an error with cut.a=0.90
.
BTW, I had most posterior probabilities way higher than 0.99 on my first 20k cases, some of them being 1 (I guess, due to precision) even when the matches were not exact.
The '@'s I did not remove. Is that a requirement?
Thanks for getting back to us! Parsing the emails as discussed above might help.
Quick question: When you say 20K cases, that is referring to the larger dataset, right?
Let's try to debug the issue by trying to match one column at time:
fastLink(data.table.1, data.table.2,
varnames=c("FullName"),
stringdist.match=c("FullName"),
cut.a=0.90,
dedupe.matches=TRUE,
verbose=TRUE,
return.df=TRUE,
n.cores=4)
Then run this:
fastLink(data.table.1, data.table.2,
varnames=c("EMail"),
stringdist.match=c("EMail"),
cut.a=0.90,
dedupe.matches=TRUE,
verbose=TRUE,
return.df=TRUE,
n.cores=4)
Keep us posted!
Ted
Yes the 20K+ cases are my full dataset.
Trying to match one column at a time, I got the same error for EMail and a different one for FullName. I couldn't copy and paste as my system collapsed (possibly due to full /temp ramdisk?), it was about emlinkMARmov -> sort() not being able to open a file.
Hi @muranyia ,
It looks like the error you are getting comes from EMail
. To check if comparisons are possible, we can try the following code:
g1 <- gammaCK2par(data.table.1$FullName, data.table.2$FullName, cut.a = 0.90, n.cores = 4)
g2 <- gammaCK2par(data.table.1$EMail, data.table.2$EMail, cut.a = 0.90, n.cores = 4)
Sorry if it is taking a few iterations, but I am sure we will figure out what the problem is.
Ted
@muranyia It is also a good idea to restart R if you are running low on resources or get strange results. I am not sure what the best way to do it is but I usually do it from the RStudio menus: "Session" -> "Restart R". A reason why @
could be problematic in string comparisons is that it is a non-ASCII character,
I fully agree with Ted.
Anders
@tedenamorado both commands finished but then my system collapsed. I guess memory/ramdisk exhaustion is more likely than a memory leak. Are you interested in a rerun while monitoring memory/disks?
@aalexandersson I always restarted R and I abandoned Rstudio for the tests and used CLI instead. As for @
note that I had no problems whatsoever with the first 20k cases, in fact the algorithm made much sense with the email addresses.
Hi @muranyia,
It looks like parsing is much needed here. Does the variable FullName
contain first and last name together? If so, I would separate those pieces of information e.g., first, middle, last name. The r-package humaniformat
can be quite useful for this. Similarly, for emails, I would do what I mentioned above, break the email into components (username, email service, domain). Then you can try fastLink using all the parsed components of the names and email addresses.
The problem now is that by running the algorithm on long strings, you will need more memory. This would be solved if you parse the information in those two fields.
Keep us posted!
Ted
Dear @tedenamorado,
Thank you for the advice. For the emails, I have no problem separating the parts, and in fact, I should not need string distance matching on them at all. With the names, the challenge is that I'm working with data from a buch of European countries, where there are so many different ways to record the same name (Lastname Firstname, Firstname Lastname, HusbandsLastname Lastname Firstname, HusbandsLastname Firstname, etc. and we're not even at abbreviations). Fastlink helps me a lot finding possible matches in an intelligent probabilistic way. Separating the parts, I'm afraid I'd lose that advantage. Having said that, I'm OK with experimenting and that's exactly what my task at hand is.
What I'd need is that the function doesn't crash. It's OK to run out of memory as long as the error is handled - other packages do that too.
I'll be more than happy to assist you with "bulletproofing" and to do the tests that you need. I think for starters I will rerun the commands and monitor the resources, just give me a few days as I cannot afford bringing down the whole system right now.
Thank you so much for the awesome package and for your help!
I was curious so I did the rerun already.
My observations:
I really hope this is not because of faulty hardware, as I wouldn't want to take up your time for something that is specific to my system. Since it only ever occurs when running the specific command, I trust it isn't.
@muranyia String distance measures of long strings such as full name or full address is not likely to be feasible unless you you use manageable parts. Even a more consistent string such as 9-digit Social Security Number likely will cause similar memory crashes unless you work on small datasets.
Updates from more experiments:
Notably, I cannot reproduce the problem by trying to sythetically reproduce my datasets (with regards to the amount of columns, rows, NAs, etc.).
What could be the culprit?
Hi @muranyia,
Thanks for your detailed feedback! If you have tried the code on more than one computer, then one possible avenue is to run the code in R, not RStudio. A few times RStudio has crashed for me for reasons I do not understand, but the same does to happen with R.
Ted
@muranyia What is the smallest reproducible example of the error so far, in terms of number of observations for the Email variable?
You wrote:
And started to work when I filtered out cases with NAs.
This FAQ might help: faq-how-to-do-a-minimal-reproducible-example-reprex-for-beginners
Anders
@aalexandersson The only reliable example so far is that in my older comment. When not all cases are NA, a haven't been able to craft a reliable one, although the synthesized versions of my real-life datasets induce the crash somewhat reliably.
I did notice that the computation time increases "exponentially" with the amount of NAs.
@muranyia Please summarize the missingness (NAs) in each linkage variable. I use and recommend naniar::miss_var_summary() for that purpose.
My own experience is that fastLink
usually performs well up to about 30% NAs. Thereafter, as you experienced, the computation time increases "very quickly" (maybe exponentially) with more NAs. Therefore, I tend not to include linkage variables with over 30% NAs. In those difficult situations with over 30% NAs, I tend to either drop some rows with NAs or drop the problematic variable from the linkage as a workaround.
I can run fastLink() stringdist.match on two variables on my datasets with up to 20k rows but with more rows, I get the following error (and a crash in RStudio) during the "Getting counts for parameter estimation" phase: