h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
322 stars 85 forks source link

R gc is choking during groupby 1e9 k=2 #110

Open jangorecki opened 4 years ago

jangorecki commented 4 years ago

Script is getting stuck (and eventually killed after exceeding timeout) due to R's gc taking too much time. Without the timeout script is killed by OS after around 6 hours. Even if it could finish at some point behaviour is not acceptable. Package-agnostic reproducible example should be produced and submitted to R-devel to investigate behavior.

This produces 1e9 rows, K=2 (unbalanced) dataset

cd data
Rscript ../_data/groupby-datagen.R 1e9 2e0 0 0

and then running data.table and dplyr groupby script on 125GB mem machine will take us to this issue. Note that recent dplyr will fail even sooner due to #152 so the older one should be used instead.

jangorecki commented 3 years ago

It seems that this issue has impact not only on 1e9_2e0_0_0 data case. The data case which starts just after failure of k=2e0 happened to fail as well at the very beginning.

stdout

# groupby-datatable.R
loading dataset G1_1e9_1e2_0_1
System errno 22 unmapping file: Invalid argument

stderr

Error in fread(src_grp, showProgress = FALSE, stringsAsFactors = TRUE) : 
  Opened 47.09GB (50558868357 bytes) file ok but could not memory map it. This i
s a 64bit process. There is probably not enough contiguous virtual memory availa
ble.
Execution halted

update: between each benchmark script there is now 15 seconds sleep, that seems to eliminate the impact of previous script to the next one, which is undesired.

jangorecki commented 3 years ago

This problem has been described in https://bugs.r-project.org/bugzilla/show_bug.cgi?id=18003