h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
322 stars 85 forks source link

Julia: fix missing ( and ) bug #147

Closed bkamins closed 4 years ago

bkamins commented 4 years ago

Do you want to reintroduce skipmissing now or it should be done when you start allowing NAs in the source data frames?

jangorecki commented 4 years ago

Thank you for fix. I will add skipmissing when data cases for NAs will be ready to use. Status of that is in #40

bkamins commented 4 years ago

Great - thank you for doing this benchmarks. They push the whole data science ecosystem forward greatly.

And we know that we now have work on joins 😄. In general we were in the process of stabilizing the API for DataFrames.jl and since release 0.21 the effort will shift towards performance in particular.

Here I have a small question (I have not checked other frameworks) - do all your tests are single-threaded or you allow multi-threading, and if yes how many cores are made available?

jangorecki commented 4 years ago

Multithreading is allowed, and multi-GPU as well (but still not yet used, blocked by missing documentation and/or support in cudf #116). All what single node machine offers. There are 40 cores available. On the bottom of report the is "Environment configuration" section that explains it better.

nalimilan commented 3 years ago

@jangorecki As we're working towards implementing some multithreading, could you help us define an acceptable strategy to choose the number of threads to use with DataFrames.jl? What do other implementations do: always use all available threads, or choose the optimal number depending on the operation? Would it be OK to specify explicitly the number of threads to use for each operation, or do you want a single global setting? Thanks!

jangorecki commented 3 years ago

@nalimilan Sure. In general it is risky to use all available threads by default because users might be working on a shared environment. R data.table by default uses 50% of logical threads. There were multiple reports from users that using 50% rather than 100% yields faster performance. Spark by default uses all cores. Here in db-benchmark we set data.table to use 100% of logical threads (setDTthreads(0L) call at start of script) because it is faster. Difference between 50% and 100% is not that big, far from being twice faster. When interface will be ready in Julia I can check if it is faster to use all threads or not. When benchmarks are running on db-benchmark production machine, there are generally no other processes, so no other processes should be fighting for cpu power.

bkamins commented 3 years ago

In general it is risky to use all available threads by default because users might be working on a shared environment.

This is exactly also our understanding and that is why @nalimilan has asked the question. My personal experience is also the fact that sometimes the operation is not CPU but e.g. memory access bound and then adding more threads is counter-productive even for the machine that does not have a "noisy neighbor" problem.

In Julia for now we plan to allow user to specify explicitly how many threads should be used "per operation" with no threading as the default (so using multiple threads is an opt-in).

When interface will be ready in Julia I can check if it is faster to use all threads or not.

Thank you. So I understand that your approach here is to mimic what would be a "production usage" scenario, where an administrator is allowed to tune the number of threads used (this is in line with the design we make in DataFrames.jl where the user is asked to explicitly specify the number of threads that should be used).

Again - thank you for such a great commitment to maintaining these benchmarks. It is a great thing to have for such a project as DataFrames.jl that has no resources on its own and is solely based on volunteer work.

nalimilan commented 3 years ago

Thanks. Indeed in general adding many threads doesn't give a big speedup. Hopefully it shouldn't be slower (except in pathological cases where there are only 1-5 row per group, but the benchmarks don't do that currently), so for benchmarking purposes it could be workable to just use as many threads as possible. For real uses it matters choose a tradeoff between speed and CPU consumption.

In practical terms, Julia is probably a bit different than e.g. R as the user chooses the maximal number of threads on startup. So that's a first way of affecting the number of threads DataFrames.jl will use, and here we could set it to 40 if that's indeed the number of cores (I think some IDEs do that automatically), or 20 (that shouldn't make a big difference). Then would it be OK to pass nthreads=40 (or equivalently nthreads=typemax(Int)) to all combine operations? Or would you prefer us to have a global setting instead?

jangorecki commented 3 years ago

@bkamins I think you are correct about memory bound operations. Opt-in sounds good, especially when introducing new feature. As for tuning cpu threads. Having "setDTthreads" set differently before each question looks little bit too much oppressive. Ideally would be to set this option once in a script. Maybe tuning for a data size will be sufficient, so 1e7, 1e8, 1e9 can have different number of threads. We already do branching like this for memory storage (in-mem / on-disk) for pydatatable and spark based on data size (on-disk for 1e9 join).

@nalimilan having combine(..., nthreads=typemax(Int)) for all questions in a script is perfectly fine as it basically do what the global option would do. Yet I think global option would be more user friendly. We do also put syntax on the benchmark plot, so having that "technical" parameters as a part of query will look less friendly. Similarly as .compute() calls at the end of each query in dask. I think the best is to provide both interfaces, a global option (default 1) and extra argument to combine if global option needs to be overwritten. R use single thread only, R packages brings multi-threading to R. In case of data.table number of threads can be changed at any time. The thing that you may find interesting is the mechanism we introduced to fallback to single threaded processing if size of input is small enough https://github.com/Rdatatable/data.table/pull/4484. As for db-benchmark queries for many groups with few rows, we have N=1e9, K=2 grouping data case G1_1e9_2e0_0_0, where q3 and q5 yield into 4e8 groups of ~2 rows. Also q10 for all data cases always result single row groups.