Closed jangorecki closed 3 years ago
Output of https://github.com/h2oai/db-benchmark/blob/63fb0fca9d7f4d9806c01e418000e42bb31778d7/_utils/compare-data.table.R script to compare last two runs of data.table (40 threads vs 20 threads).
source("_utils/time.R")
if (system("tail -1 time.csv | cut -d',' -f2", intern=TRUE)!="1621364165")
stop("time.csv and logs.csv should be as of 1621364165 batch run, filter out newer rows in those files")
## groupby ----
d = tail.time("data.table", "groupby", i=c(1L, 2L))
setnames(d, c("20210517_2f2f62d","20210518_2f2f62d"), c("th_40","th_20"))
if (nrow(d[(is.na(th_40) & !is.na(th_20)) | (!is.na(th_40) & is.na(th_20))])) {
stop("number of threads had an impact on completion of queries")
} else {
d = d[!is.na(th_40)]
}
d[, th_40_20:=th_40/th_20]
## improvement
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(in_rows)]
# in_rows mean median
#1: 1e7 1.0242721 0.9609988
#2: 1e8 0.9378870 0.9455267
#3: 1e9 0.9506561 0.9569359
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(knasorted)]
# knasorted mean median
#1: 1e2 cardinality factor, 0% NAs, unsorted data 1.0393667 0.9538973
#2: 1e1 cardinality factor, 0% NAs, unsorted data 0.9521915 0.9544223
#3: 2e0 cardinality factor, 0% NAs, unsorted data 0.9604950 0.9569359
#4: 1e2 cardinality factor, 0% NAs, pre-sorted data 0.9371154 0.9487804
#5: 1e2 cardinality factor, 5% NAs, unsorted data 0.9678192 0.9598999
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(question_group)]
# question_group mean median
#1: basic 0.9548596 0.9301310
#2: advanced 0.9897345 0.9806791
## worst case by data
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(in_rows, knasorted)][which.max(mean)]
# in_rows knasorted mean median
#1: 1e7 1e2 cardinality factor, 0% NAs, unsorted data 1.239259 0.9620776
## best case by data
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(in_rows, knasorted)][which.min(mean)]
# in_rows knasorted mean median
#1: 1e8 1e2 cardinality factor, 0% NAs, unsorted data 0.9235102 0.9200373
## worst case for single question
d[which.max(th_40_20)]
# in_rows knasorted question_group question th_40 th_20 th_40_20
#1: 1e7 1e2 cardinality factor, 0% NAs, unsorted data basic sum v1 by id1:id2 0.413 0.118 3.5
## best case for single question
d[which.min(th_40_20)]
# in_rows knasorted question_group question th_40 th_20 th_40_20
#1: 1e9 1e2 cardinality factor, 5% NAs, unsorted data basic sum v1 mean v3 by id3 15.22 21.104 0.7211903
## join ----
d = tail.time("data.table", "join", i=c(1L, 2L))
setnames(d, c("20210517_2f2f62d","20210518_2f2f62d"), c("th_40","th_20"))
if (nrow(d[(is.na(th_40) & !is.na(th_20)) | (!is.na(th_40) & is.na(th_20))])) {
stop("number of threads had an impact on completion of queries")
} else {
d = d[!is.na(th_40)]
}
d[, th_40_20:=th_40/th_20]
## improvement
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(in_rows)]
# in_rows mean median
#1: 1e7 1.0149302 1.0000000
#2: 1e8 0.9143243 0.9008573
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(knasorted)]
# knasorted mean median
#1: 0% NAs, unsorted data 0.9385902 0.9144130
#2: 5% NAs, unsorted data 0.9612286 0.9294773
#3: 0% NAs, pre-sorted data 0.9940629 0.9705720
## worst case by data
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(in_rows, knasorted)][which.max(mean)]
# in_rows knasorted mean median
#1: 1e7 0% NAs, pre-sorted data 1.055906 1.05
## best case by data
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(in_rows, knasorted)][which.min(mean)]
# in_rows knasorted mean median
#1: 1e8 0% NAs, unsorted data 0.8983325 0.8773762
## worst case for single question
d[which.max(th_40_20)]
# in_rows knasorted question th_40 th_20 th_40_20
#1: 1e7 5% NAs, unsorted data medium inner on factor 0.513 0.443 1.158014
## best case for single question
d[which.min(th_40_20)]
# in_rows knasorted question th_40 th_20 th_40_20
#1: 1e8 0% NAs, unsorted data medium outer on int 8.143 9.558 0.8519565
We can see that on smallest data size (0.5GB) using 100% vs 50% can eventually cause degrade in performance. On average slower by 2.5%. Where on balanced data (k=1e2) slowdown is up to 4%. While the other data cases are generally faster using 100%. Including balanced data (k=1e2) and pre-sorted, where speed up is up to 7%. Group of question also makes a difference, simple, easier to optimize questions benefits more from more threads. Worst case by data is 23% average slowdown for 1e7 rows, balanced k=1e2 and unsorted. Note that taking a median it is still 4% faster to use 100%. Best case by data is 8% average speed up for 1e8 rows, balanced k=1e2 and unsorted. Worst case for single question had slow down by 350% (0.413s vs 0.118s) on 1e7 rows, k=1e2, unsorted data, "sum v1 by id1:id2". Best case for single question had speed up by 28% (15.22s vs 21.104s) on 1e9 rows, k=1e2, 5% NAs, unsorted data, "sum v1 mean v3 by id3".
It seems that only smallest size is actually sensitive to number of threads. 350% slow down in single question may look serious but we have to remember that this is not an average, its total timing is well under a second, thus it is highly sensitive to any noise, even a small background running processes on the machine. The next worst case single question was only 26% slow down, therefore we can consider 350% case as an outlier.
We can see that on smallest data size (0.5GB) using 100% vs 50% has marginal impact on performance. On 5GB data speed up is clearly visible. Pre-sorted data benefits less from more threads. Worst case by data is just 5% average slowdown for 1e7 rows pre-sorted. Best case by data is 11% average speed up for 1e8 rows unsorted. Worst case for single question had slow down by 15% on 1e7 rows, 5% NAs, unsorted data "medium inner on factor". Best case for single question had speed up by 15% on 1e8 rows, unsorted data "medium outer on int".
Join is less sensitive to threads, but we it is also less parallelized as of now. Once bmerge
in data.table will be well parallelized this join analysis should be repeated.
The overall conclusion seems to be that difference between 50% (default) and 100% is not that significant, users can safely stay on the default and they will not lose much. If users are after performance and and working with larger data then setting more threads is worth. This of course assume the process does not run in an shared environment. In case of a shared environment, or a desktop computer where user performs other activities I would advise to keep default 50%. fyi @mattdowle
And this is what I have assumed :). Thank you for running these tests!
Analyze impact of using 50% of cpus for data.table (default) vs. currently set to use 100%