Timeout has to be increase much further. Looking at data.table and spark and groupby (1e7, k=100) vs join, latter seems to be taking 2-8x longer. This is likely caused by loading data. Groupby requires load 45gb once, while join requires to load 55gb x2. Due to those data sizes groupby still can be computed in memory, but join needs on-disk data storage, this contributes to longer computation time even more. To reduce the total amount of time that benchmark will be spending on join task, we can make timeout parameter granular for different data size. So 1e7 could have 30 minutes, 1e8 could have 2h (both should fit into memory), and 1e9 8h (on disk). Then at least we won't wait 8h on some slow solution trying to solve 1e7 size.
Timeout has to be increase much further. Looking at data.table and spark and groupby (1e7, k=100) vs join, latter seems to be taking 2-8x longer. This is likely caused by loading data. Groupby requires load 45gb once, while join requires to load 55gb x2. Due to those data sizes groupby still can be computed in memory, but join needs on-disk data storage, this contributes to longer computation time even more. To reduce the total amount of time that benchmark will be spending on join task, we can make timeout parameter granular for different data size. So 1e7 could have 30 minutes, 1e8 could have 2h (both should fit into memory), and 1e9 8h (on disk). Then at least we won't wait 8h on some slow solution trying to solve 1e7 size.
Originally posted by @jangorecki in https://github.com/h2oai/db-benchmark/issues/126#issuecomment-561716389