markvanderloo / stringdist

String distance functions for R
319 stars 36 forks source link

stack overflow warnings/errors when comparing large(ish?) vectors of integers #101

Open c-tho opened 1 year ago

c-tho commented 1 year ago

Running over vectors of 100k integers produces stack imbalance warnings at best and aborts the R session at worst:

# Two vectors of 100k random integers 1-12
d1 <- sample(1:12, 100000, replace = TRUE)
d2 <- sample(1:12, 100000, replace = TRUE)
# Compare
v <- stringdist::stringdist(d1, d2, method = "dl")
> Warning: stack imbalance in '<-', 2 then 21342

Attempting three of these (for three date components) within a function aborts the session, with

Error: protect(): protection stack overflow
Error: no more error handlers available (recursive errors?); invoking 'abort' restart

This problem can be sidestepped by specifying nthread = 1. Default value for get_option("sd_num_thread") for me is 7.

sessionInfo:

R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.utf8  LC_CTYPE=English_Australia.utf8
[3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C
[5] LC_TIME=English_Australia.utf8    

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base

other attached packages:
[1] data.table_1.14.8

loaded via a namespace (and not attached):
[1] compiler_4.2.0    cli_3.6.1         parallel_4.2.0    tools_4.2.0
[5] jsonlite_1.8.5    rlang_1.1.1       renv_0.17.3       stringdist_0.9.10
markvanderloo commented 1 year ago

Confirmed, that seems to be a bug. Seems to be independent of the chosen distance.

Edit. A bit confusing because I have worked with stringdist on millions of records before. Edit. The bug is irreproducible. Running the following script with R -f multiple times sometimes gives a stack imbalance, sometimes not.

library(stringdist)

set.seed(1)
n <- 1000
x <- sample(0:9, size=n, replace=TRUE)
y <- sample(0:9, size=n, replace=TRUE)

out <- stringdist(x,y, method="osa", nthread=2)

It does not seem to occur with nthread=1

Edit As stated in the bugreport: this only occurs when stringdist is provided an integer vector. Which is weird because stringdist does not do anything special there: stringdist casts all input to character before any further processing. Even adding a single "a" to x and y in the above script prevents the warning.