Julia queries should not include compilation time

h2oai / db-benchmark

reproducible benchmark of database-like ops

https://h2oai.github.io/db-benchmark

Mozilla Public License 2.0

323 stars 85 forks source link

Julia queries should not include compilation time #69

Closed jangorecki closed 5 years ago

jangorecki commented 5 years ago

As a follow up from comments #12 putting that into own issue. @nalimilan

Cheating would be looking at the data. We can use dummy 0 rows dataframe of exact structure, which is a metadata, not data itself. Run query on that to compile commands. Will that work? IMO this will not be cheating because we know upfront what questions we want to ask against data. Assuming the task does not meant to reflect interactive data queries scenario but more a predefined processing workflow scenario @mattdowle.

Yes, that would work. Anyway the data shouldn't make any difference for Julia, only the column types. So you could do dummydf = df[1:1, :] and run queries on that first. But even running an arbitrary query would already eliminate a large part of the overhead, which is shared across operations.

jangorecki commented 5 years ago

@nalimilan assuming we don't want to touch data - so instead of df[1:1, :] which already filters data - we have to create 0-1 rows dataframe "by hand". Which is fine, unless some column might have different type depending on categoricals=0.05 in CSV reader. What is proper way to construct such dummy dataframe just by obtaining column types from df?

nalimilan commented 5 years ago

You can do df2 = similar(df, 0) to get an empty data frame with the same column types as df but zero lines. Then you can add rows with arbitrary data with e.g. push!(df2, (1, "a")). You can also do similar(df, 1) to get an uninitialized row, that you can then fill using the normal indexing syntax (numeric values will be filled with uninitialized values, but others are undefined and need to be set manually).

Overall, taking the first row is simpler. Of course it uses real data, but it cannot possibly help an implementation to be faster on the next run.

st-pasha commented 5 years ago

I'm not sure I understand the "fairness" logic here. I feel it would be useful to be very specific and very explicit in this regard.

As for precompiling, aren't all packages doing it internally anyways? The data.table has 1500+ lines of code converting R i-j-by expressions into the representation suitable for calling the internal C code. SQL engines have sophisticated query analyzers/optimizers. This is all compiling. Different solutions use different strategies for this. Some can compiler once and then reuse the results in subsequent queries, others cannot. Some use high-level languages during compilation, others more optimized low-level languages. Some use internal data stats to decide how to run the query optimally. Others use the types of data that participate in the query. There are solutions that can adjust and switch strategies on the fly. Obviously, the solutions also differ in how exactly they perform compilation. A Java-based library may create new classes on the fly or just rely on JIT compiling; a C/C++ solution may compile into internal structure describing what needs to be done; an LLVM-based solution can compile machine-level code, though it usually takes much longer; python can compile python code into internal byte-code; etc, etc.

So, what exactly are the guidelines to ensure that all solutions are tested on equal grounds?

mattdowle commented 5 years ago

I added some thoughts here: https://github.com/h2oai/db-benchmark/issues/60#issuecomment-456198103.

jangorecki commented 5 years ago

Great to see so much feedback here.

@nalimilan any idea how spark is handling that problem? we might have to add grouping on dummy/empty dataset for all solutions to be fair.

So, what exactly are the guidelines to ensure that all solutions are tested on equal grounds?

That is obviously not easy to define. I would say that using the API of language and a solution while not looking at data, but knowing the schema and question that are going to be asked. As mentioned below to simulate "data processing workflow". Using the API is a tricky thing because for example in R you can write C function from R and use in same session. I would say that code has to written in the most-common way for a solution to achieve the results.

Quoting @mattdowle from issue linked above

What we're trying to mimic and represent fairly is the user experience. In my own experience I rarely needed to compute the same answer many times

This is fundamental question. Do we want to cover an user's interactive data queries or a data processing workflow. In the first case the questions we are going to ask on the data are unknown, in the latter only the data are unknown. I argue to focus on the latter case. I believe that total timing of "data processing workflows" around the world is overwhelmingly larger than total timing of "user's interactive data queries". Just because workflows runs regularly. Aside from that "user's behavior" cannot be precisely defined, while predefined workflow is unambiguous.

Especially in production there was a task that ran once...

This fits more into "a data processing workflow".

Just to say as well that we do want to do end-to-end benchmarks too

That seems to be better place to include the cost of compilation then.

mattdowle commented 5 years ago

To follow up on things I've been quoted on to clarify ...

There is a wide range of use-cases. There is a wide range of users. Many users have a wide range of use-cases simultaneously.

On db-bench we're not trying to pick. We shouldn't try and focus. The overriding goal is to be transparent and be helpful to all the users with all the use-cases so they can decide for themselves. That's why we include the 1st and the 2nd run and present both. I've said that it was the 1st run that was most important to me most often in the past because I was justifying why the 1st run is included and reported. I was not saying that because that's what I think we should focus on, although I see how that may have been how it came across. Other users will look at the 2nd run time and know that's more applicable to their use-case. [Aside: it's kind of assumed that runs 3-100 are very close to run 2; that run 1 is the outlier. We want to build a check on that in, say by performing run 3 and maybe 4 too. But total runtime is already 37 hours so that's going to be interesting from the resource point of view. ]

There were situations where I did a setkey() first (an up-front cost) to get a speed benefit later if I was using the same key many times in different queries later. In my view the goal of db-bench is to benchmark those trade-offs transparently so users can decide for themselves for their use-case. The user needs to know how long does a setkey() take (and the equivalent for each of the products) and is it worth it. This issue about Julia compilation seems similar to that.

If the advice to Julia users is to run dummydf=df[1:1, :] and run queries on that first as a way to trigger one-time compilation, then whatever those queries are could be placed in a function JuliaDF.compile() and chalked up to Julia startup time. Users could just put that in their startup profile script. We don't currently display how long Python or R takes to start, but if we do (and it is important in some use-cases where many processes are started) then the time for JuliaDF.compile() could be included there.

But note that the culprit being Julia compilation time was only a quick theory (by the Julia team) and it's likely not to do with compilation time after all. These are exactly the sort of inferences that db-bench is there to eek out transparently and why it's so important that it reruns regularly using latest dev versions.

st-pasha commented 5 years ago

I agree with most of the points that Matt is making.

Except for the dummydf=df[1:1, :] line in the Julia startup script. I don't think it's feasible to have the data available at that point. Even in production environment there are usually multiple scripts operating with different data streams, so it wouldn't make sense to specialize the startup script in such a way.

This situation, btw, is very similar to the task of reading a CSV file. A flexible reader is expected to learn the schema from the file itself. A more limited reader will require the user to specify all types correctly upfront. Clearly the more flexible reader will be at a disadvantage if we were to give the types information (metadata) to the limited reader without any penalty.

Reply from Matt as edit : I didn't realize that's what they were suggesting by dummydf=df[1:1, :] but I see now it might have been. I was talking about a general JuliaDF.compile() that didn't depend on the data. It was as if they had to have some dummy data as a workaround to trigger the compile. I didn't think that dummy data needed to come from the actual data that would be loaded in future in the benchmark. In that case, yes I agree with you.

nalimilan commented 5 years ago

@nalimilan any idea how spark is handling that problem? we might have to add grouping on dummy/empty dataset for all solutions to be fair.

@jangorecki Sorry, I have no idea how Spark works. But indeed the same kind of optimization techniques should probably be applied everywhere or nowhere.

If the advice to Julia users is to run dummydf=df[1:1, :] and run queries on that first as a way to trigger one-time compilation, then whatever those queries are could be placed in a function JuliaDF.compile() and chalked up to Julia startup time. Users could just put that in their startup profile script. We don't currently display how long Python or R takes to start, but if we do (and it is important in some use-cases where many processes are started) then the time for JuliaDF.compile() could be included there.

@mattdowle I'm not sure how useful it would be to pay the price of compilation on load rather than when actually running the function. What we're looking forward to is Julia being able of saving the compiled code so that the cost is paid only when installing the package. But if it needs to be paid again in each session, compiling all functions for all possible column types would be counter-productive since the user will typically only use a subset of them.

So I don't have a strong opinion as to whether compilation times should be included or not. Once Julia supports storing compiled code, most of it will go away in the benchmarks without changing anything. Until then (it will probably take some time), the question is whether the benchmark is supposed to cover sessions in which only one grouping operation is run or longer sessions where multiple operations are typically performed.

But note that the culprit being Julia compilation time was only a quick theory (by the Julia team) and it's likely not to do with compilation time after all. These are exactly the sort of inferences that db-bench is there to eek out transparently and why it's so important that it reruns regularly using latest dev versions.

@mattdowle Actually after some investigation I can confirm that compilation gets slower after loading a large data set (https://github.com/JuliaLang/julia/issues/30800). That's probably due to the garbage collector, and that should be improved by pooling strings, which we want to do anyway but isn't supported by CSV.jl yet. I'll report when we've made some progress. But that's orthogonal to the question of whether it's appropriate to include compilation times.

Except for the dummydf=df[1:1, :] line in the Julia startup script. I don't think it's feasible to have the data available at that point. Even in production environment there are usually multiple scripts operating with different data streams, so it wouldn't make sense to specialize the startup script in such a way.

@st-pasha As noted above we don't actually need access to the data, df[1:1, :] is just a simple way to create a toy dataset with the same schema as the actual one.

mattdowle commented 5 years ago

with the same schema as the actual one

I hadn't grasped that part before. Then it's something the user experiences and needs to know about and run after the data is loaded, so shouldn't be excluded from db-bench. There could maybe be an extra manual warm up task 0 reported separately but only Julia would need that.

@nalimilan If there's a way to compile all types up front in each session, before knowing anything about the data, then I'd say it's totally fair to do that (that's what I had in mind by JulaDF.compile()). Unless that really takes a long time or a lot of memory but I can't imagine that to be the case.

But if it needs to be paid again in each session, compiling all functions for all possible column types would be counter-productive since the user will typically only use a subset of them.

I don't see why that's a problem. Unless it takes a long time or a lot of memory.

the question is whether the benchmark is supposed to cover sessions in which only one grouping operation is run or longer sessions where multiple operations are typically performed.

More the latter : multiple operations. That's why the main db-bench barplot is not just 5 separate grouping tasks but includes the total of the 10 runs too (2 consecutive run of each of the 5) reported at the top in the legend. The legend is sorted by total time. The idea of that is to try and roughly mimic a little bit what a user might do in practice: do a few different groupings in a single session. So it's working out fairly for Julia: the total time for Julia includes the compilation time once-only in the first run of the first task. The other 9 runs don't include that time. Even if we split out the up-front compilation time into a new task 0, that wouldn't change the total time in the legend (unless it was a general purpose JuliaDF.compile() that didn't need to see the data first, in which case it could be taken out.)

nalimilan commented 5 years ago

OK, let me discuss this with others. We could certainly provide a function or script which would trigger compilation of functions for all common types. It would take a relatively long time if done thoroughly, so I wouldn't want to run it by default, but reporting the time it takes separately would definitely make sense.

jangorecki commented 5 years ago

cuDF and ClickHouse has been recently added benchmark. They suffer more badly than juliadf in terms of the issue discussed here. IMO we should not skip compilation time, but still allow Julia/others to optimise that internally. So basically what Matt suggested

If there's a way to compile all types up front in each session, before knowing anything about the data, then I'd say it's totally fair

Closing this issue for now, as there is nothing we can do about that here. If such feature will be added to Julia DF please let us know, or fill PR directly to activate it.