h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
321 stars 85 forks source link

test data.table set to use only 50% of cpus #202

Closed jangorecki closed 3 years ago

jangorecki commented 3 years ago

Analyze impact of using 50% of cpus for data.table (default) vs. currently set to use 100%

jangorecki commented 3 years ago

Output of https://github.com/h2oai/db-benchmark/blob/63fb0fca9d7f4d9806c01e418000e42bb31778d7/_utils/compare-data.table.R script to compare last two runs of data.table (40 threads vs 20 threads).

source("_utils/time.R")
if (system("tail -1 time.csv | cut -d',' -f2", intern=TRUE)!="1621364165")
  stop("time.csv and logs.csv should be as of 1621364165 batch run, filter out newer rows in those files")

## groupby ----

d = tail.time("data.table", "groupby", i=c(1L, 2L))
setnames(d, c("20210517_2f2f62d","20210518_2f2f62d"), c("th_40","th_20"))
if (nrow(d[(is.na(th_40) & !is.na(th_20)) | (!is.na(th_40) & is.na(th_20))])) {
  stop("number of threads had an impact on completion of queries")
} else {
  d = d[!is.na(th_40)]
}
d[, th_40_20:=th_40/th_20]

## improvement
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(in_rows)]
#   in_rows      mean    median
#1:     1e7 1.0242721 0.9609988
#2:     1e8 0.9378870 0.9455267
#3:     1e9 0.9506561 0.9569359
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(knasorted)]
#                                         knasorted      mean    median
#1:   1e2 cardinality factor, 0% NAs, unsorted data 1.0393667 0.9538973
#2:   1e1 cardinality factor, 0% NAs, unsorted data 0.9521915 0.9544223
#3:   2e0 cardinality factor, 0% NAs, unsorted data 0.9604950 0.9569359
#4: 1e2 cardinality factor, 0% NAs, pre-sorted data 0.9371154 0.9487804
#5:   1e2 cardinality factor, 5% NAs, unsorted data 0.9678192 0.9598999
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(question_group)]
#   question_group      mean    median
#1:          basic 0.9548596 0.9301310
#2:       advanced 0.9897345 0.9806791

## worst case by data
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(in_rows, knasorted)][which.max(mean)]
#   in_rows                                     knasorted     mean    median
#1:     1e7 1e2 cardinality factor, 0% NAs, unsorted data 1.239259 0.9620776
## best case by data
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(in_rows, knasorted)][which.min(mean)]
#   in_rows                                     knasorted      mean    median
#1:     1e8 1e2 cardinality factor, 0% NAs, unsorted data 0.9235102 0.9200373

## worst case for single question
d[which.max(th_40_20)]
#   in_rows                                     knasorted question_group          question th_40 th_20 th_40_20
#1:     1e7 1e2 cardinality factor, 0% NAs, unsorted data          basic sum v1 by id1:id2 0.413 0.118      3.5
## best case for single question
d[which.min(th_40_20)]
#   in_rows                                     knasorted question_group              question th_40  th_20  th_40_20
#1:     1e9 1e2 cardinality factor, 5% NAs, unsorted data          basic sum v1 mean v3 by id3 15.22 21.104 0.7211903

## join ----

d = tail.time("data.table", "join", i=c(1L, 2L))
setnames(d, c("20210517_2f2f62d","20210518_2f2f62d"), c("th_40","th_20"))
if (nrow(d[(is.na(th_40) & !is.na(th_20)) | (!is.na(th_40) & is.na(th_20))])) {
  stop("number of threads had an impact on completion of queries")
} else {
  d = d[!is.na(th_40)]
}
d[, th_40_20:=th_40/th_20]

## improvement
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(in_rows)]
#   in_rows      mean    median
#1:     1e7 1.0149302 1.0000000
#2:     1e8 0.9143243 0.9008573
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(knasorted)]
#                 knasorted      mean    median
#1:   0% NAs, unsorted data 0.9385902 0.9144130
#2:   5% NAs, unsorted data 0.9612286 0.9294773
#3: 0% NAs, pre-sorted data 0.9940629 0.9705720

## worst case by data
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(in_rows, knasorted)][which.max(mean)]
#   in_rows               knasorted     mean median
#1:     1e7 0% NAs, pre-sorted data 1.055906   1.05
## best case by data
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(in_rows, knasorted)][which.min(mean)]
#   in_rows             knasorted      mean    median
#1:     1e8 0% NAs, unsorted data 0.8983325 0.8773762

## worst case for single question
d[which.max(th_40_20)]
#   in_rows             knasorted               question th_40 th_20 th_40_20
#1:     1e7 5% NAs, unsorted data medium inner on factor 0.513 0.443 1.158014
## best case for single question
d[which.min(th_40_20)]
#   in_rows             knasorted            question th_40 th_20  th_40_20
#1:     1e8 0% NAs, unsorted data medium outer on int 8.143 9.558 0.8519565
jangorecki commented 3 years ago

groupby

We can see that on smallest data size (0.5GB) using 100% vs 50% can eventually cause degrade in performance. On average slower by 2.5%. Where on balanced data (k=1e2) slowdown is up to 4%. While the other data cases are generally faster using 100%. Including balanced data (k=1e2) and pre-sorted, where speed up is up to 7%. Group of question also makes a difference, simple, easier to optimize questions benefits more from more threads. Worst case by data is 23% average slowdown for 1e7 rows, balanced k=1e2 and unsorted. Note that taking a median it is still 4% faster to use 100%. Best case by data is 8% average speed up for 1e8 rows, balanced k=1e2 and unsorted. Worst case for single question had slow down by 350% (0.413s vs 0.118s) on 1e7 rows, k=1e2, unsorted data, "sum v1 by id1:id2". Best case for single question had speed up by 28% (15.22s vs 21.104s) on 1e9 rows, k=1e2, 5% NAs, unsorted data, "sum v1 mean v3 by id3".

It seems that only smallest size is actually sensitive to number of threads. 350% slow down in single question may look serious but we have to remember that this is not an average, its total timing is well under a second, thus it is highly sensitive to any noise, even a small background running processes on the machine. The next worst case single question was only 26% slow down, therefore we can consider 350% case as an outlier.

jangorecki commented 3 years ago

join

We can see that on smallest data size (0.5GB) using 100% vs 50% has marginal impact on performance. On 5GB data speed up is clearly visible. Pre-sorted data benefits less from more threads. Worst case by data is just 5% average slowdown for 1e7 rows pre-sorted. Best case by data is 11% average speed up for 1e8 rows unsorted. Worst case for single question had slow down by 15% on 1e7 rows, 5% NAs, unsorted data "medium inner on factor". Best case for single question had speed up by 15% on 1e8 rows, unsorted data "medium outer on int".

Join is less sensitive to threads, but we it is also less parallelized as of now. Once bmerge in data.table will be well parallelized this join analysis should be repeated.

jangorecki commented 3 years ago

The overall conclusion seems to be that difference between 50% (default) and 100% is not that significant, users can safely stay on the default and they will not lose much. If users are after performance and and working with larger data then setting more threads is worth. This of course assume the process does not run in an shared environment. In case of a shared environment, or a desktop computer where user performs other activities I would advise to keep default 50%. fyi @mattdowle

bkamins commented 3 years ago

And this is what I have assumed :). Thank you for running these tests!