h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
321 stars 85 forks source link

Julia: make CSV.jl read with defaut settings #194

Closed bkamins closed 3 years ago

bkamins commented 3 years ago

After 1.0 release of DataFrames.jl we have cleaned up API and integration with CSV.jl and it should be good to just use the default settings of the reader.

bkamins commented 3 years ago

@jangorecki - I would try to see the next benchmark results with the default settings of DataFrames.jl and CSV.jl after 1.0 release of DataFrames.jl. This would help us with tuning things if there is a need. Are you OK with this? (also API got cleaned up a bit as you can see in the changed codes)

CC @nalimilan @quinnj

bkamins commented 3 years ago

I added some changes to how GC.gc is triggered following the recommendation by Julia devs how to set GC to its normal state before running the benchmark.

jangorecki commented 3 years ago

What is the point of 6ba25aa? "ver" scripts are only used for dumping package version so number of threads does not make any difference. BTW. data.table's default is 50% cores but in each script we set to use 100% (setDTthreads(0L)).

I am running 1.0.1 now, once it is completed I will merge this PR and run again on 1.0.1.

bkamins commented 3 years ago

Ah - so this should be fixed. I will revert.

In what file/line do you start Julia process that does the actual computation? (I guess now it is single threaded - maybe we can leave it for the current benchmarks - to see single threaded performance and then in the next round enable threading to see the impact - this would be very useful for us)

bkamins commented 3 years ago

OK - fixed.

jangorecki commented 3 years ago

In what file/line do you start Julia process that does the actual computation?

https://github.com/h2oai/db-benchmark/blob/ff6975310f8f818462f7357551079bf0d8f9fc51/_launcher/solution.R#L162 It is being run using ./script.jl kind of command, so the header of benchmark script file (#!/usr/bin/env julia) directs to julia process.

then in the next round enable threading to see the impact

so ideally will be to have this in a separate PR then

bkamins commented 3 years ago

so ideally will be to have this in a separate PR then

Yes. I would make a separate PR later next month. Thank you!

jangorecki commented 3 years ago

I posted timings of 1.0.1 vs 0.22.7 in https://github.com/h2oai/db-benchmark/issues/195#issuecomment-827398822 There is a big speed up in majority of cases.

bkamins commented 3 years ago

Thank you for posting this. When is the next re-run of benchmarks planned? (in the we have found some cases that we are currently fixing as 1.0 was a major re-write)

bkamins commented 3 years ago

I an additional question: https://h2oai.github.io/db-benchmark/history.html does not seem to be updated. Do I see it correctly?

jangorecki commented 3 years ago

In about 2 hours run including this PR should be finished.

history report looks fine for me, try another browser or clearing browser cache. It happened to me multiple times that browser was showing cached images.

bkamins commented 3 years ago

Ah - thank you!

bkamins commented 3 years ago

@quinnj - the benchmarks after merging this PR are out. Actually we get a significant regression and for groupby tests 50GB tests run out of memory (in the previous settings they worked). We need to investigate the reason for the regression.

@jangorecki - thank you very much for providing such tests.

nalimilan commented 3 years ago

@quinnj - the benchmarks after merging this PR are out. Actually we get a significant regression and for groupby tests 50GB tests run out of memory (in the previous settings they worked). We need to investigate the reason for the regression.

Maybe we should reintroduce the types argument? As noted above I think we added it because it uses less memory.

jangorecki commented 3 years ago

timings diff of this PR, looks to be much slower

groupby

|in_rows |knasorted                                       |question_group |question                    | 20210426_78028b8| 20210427_78028b8|    new2old|
|:-------|:-----------------------------------------------|:--------------|:---------------------------|----------------:|----------------:|----------:|
|1e7     |1e2 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 by id1               |            0.304|            0.696|  2.2894737|
|1e7     |1e2 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 by id1:id2           |            0.084|            1.117| 13.2976190|
|1e7     |1e2 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 mean v3 by id3       |            0.456|            1.428|  3.1315789|
|1e7     |1e2 cardinality factor, 0% NAs, unsorted data   |basic          |mean v1:v3 by id4           |            0.185|            0.182|  0.9837838|
|1e7     |1e2 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1:v3 by id6            |            0.346|            0.348|  1.0057803|
|1e7     |1e2 cardinality factor, 0% NAs, unsorted data   |advanced       |median v3 sd v3 by id4 id5  |            1.861|            2.011|  1.0806018|
|1e7     |1e2 cardinality factor, 0% NAs, unsorted data   |advanced       |max v1 - min v2 by id3      |            1.400|            2.359|  1.6850000|
|1e7     |1e2 cardinality factor, 0% NAs, unsorted data   |advanced       |largest two v3 by id6       |            2.096|            2.410|  1.1498092|
|1e7     |1e2 cardinality factor, 0% NAs, unsorted data   |advanced       |regression v1 v2 by id2 id4 |            1.266|            2.046|  1.6161137|
|1e7     |1e2 cardinality factor, 0% NAs, unsorted data   |advanced       |sum v3 count by id1:id6     |            2.484|            2.378|  0.9573269|
|1e7     |1e1 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 by id1               |            0.322|            0.314|  0.9751553|
|1e7     |1e1 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 by id1:id2           |            0.073|            0.074|  1.0136986|
|1e7     |1e1 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 mean v3 by id3       |            0.610|            1.858|  3.0459016|
|1e7     |1e1 cardinality factor, 0% NAs, unsorted data   |basic          |mean v1:v3 by id4           |            0.206|            0.188|  0.9126214|
|1e7     |1e1 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1:v3 by id6            |            0.627|            0.577|  0.9202552|
|1e7     |1e1 cardinality factor, 0% NAs, unsorted data   |advanced       |median v3 sd v3 by id4 id5  |            1.390|            1.430|  1.0287770|
|1e7     |1e1 cardinality factor, 0% NAs, unsorted data   |advanced       |max v1 - min v2 by id3      |            2.263|            3.534|  1.5616438|
|1e7     |1e1 cardinality factor, 0% NAs, unsorted data   |advanced       |largest two v3 by id6       |            4.423|            4.389|  0.9923129|
|1e7     |1e1 cardinality factor, 0% NAs, unsorted data   |advanced       |regression v1 v2 by id2 id4 |            0.930|            0.912|  0.9806452|
|1e7     |1e1 cardinality factor, 0% NAs, unsorted data   |advanced       |sum v3 count by id1:id6     |            2.485|            2.206|  0.8877264|
|1e7     |2e0 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 by id1               |            0.305|            0.299|  0.9803279|
|1e7     |2e0 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 by id1:id2           |            0.072|            0.070|  0.9722222|
|1e7     |2e0 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 mean v3 by id3       |            0.755|            1.879|  2.4887417|
|1e7     |2e0 cardinality factor, 0% NAs, unsorted data   |basic          |mean v1:v3 by id4           |            0.191|            0.188|  0.9842932|
|1e7     |2e0 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1:v3 by id6            |            1.142|            1.140|  0.9982487|
|1e7     |2e0 cardinality factor, 0% NAs, unsorted data   |advanced       |median v3 sd v3 by id4 id5  |            1.126|            1.138|  1.0106572|
|1e7     |2e0 cardinality factor, 0% NAs, unsorted data   |advanced       |max v1 - min v2 by id3      |            3.156|            4.243|  1.3444233|
|1e7     |2e0 cardinality factor, 0% NAs, unsorted data   |advanced       |largest two v3 by id6       |            8.847|            8.438|  0.9537696|
|1e7     |2e0 cardinality factor, 0% NAs, unsorted data   |advanced       |regression v1 v2 by id2 id4 |            0.809|            0.819|  1.0123609|
|1e7     |2e0 cardinality factor, 0% NAs, unsorted data   |advanced       |sum v3 count by id1:id6     |            2.479|            2.289|  0.9233562|
|1e7     |1e2 cardinality factor, 0% NAs, pre-sorted data |basic          |sum v1 by id1               |            0.328|            0.318|  0.9695122|
|1e7     |1e2 cardinality factor, 0% NAs, pre-sorted data |basic          |sum v1 by id1:id2           |            0.090|            0.090|  1.0000000|
|1e7     |1e2 cardinality factor, 0% NAs, pre-sorted data |basic          |sum v1 mean v3 by id3       |            0.457|            1.156|  2.5295405|
|1e7     |1e2 cardinality factor, 0% NAs, pre-sorted data |basic          |mean v1:v3 by id4           |            0.182|            0.181|  0.9945055|
|1e7     |1e2 cardinality factor, 0% NAs, pre-sorted data |basic          |sum v1:v3 by id6            |            0.383|            0.365|  0.9530026|
|1e7     |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced       |median v3 sd v3 by id4 id5  |            1.898|            1.883|  0.9920969|
|1e7     |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced       |max v1 - min v2 by id3      |            1.334|            2.024|  1.5172414|
|1e7     |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced       |largest two v3 by id6       |            1.997|            2.087|  1.0450676|
|1e7     |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced       |regression v1 v2 by id2 id4 |            0.857|            0.895|  1.0443407|
|1e7     |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced       |sum v3 count by id1:id6     |            2.460|            2.176|  0.8845528|
|1e7     |1e2 cardinality factor, 5% NAs, unsorted data   |basic          |sum v1 by id1               |            0.343|            0.981|  2.8600583|
|1e7     |1e2 cardinality factor, 5% NAs, unsorted data   |basic          |sum v1 by id1:id2           |            0.084|            1.348| 16.0476190|
|1e7     |1e2 cardinality factor, 5% NAs, unsorted data   |basic          |sum v1 mean v3 by id3       |            0.486|            1.544|  3.1769547|
|1e7     |1e2 cardinality factor, 5% NAs, unsorted data   |basic          |mean v1:v3 by id4           |            0.325|            1.016|  3.1261538|
|1e7     |1e2 cardinality factor, 5% NAs, unsorted data   |basic          |sum v1:v3 by id6            |            0.400|            0.390|  0.9750000|
|1e7     |1e2 cardinality factor, 5% NAs, unsorted data   |advanced       |median v3 sd v3 by id4 id5  |            2.042|            2.389|  1.1699314|
|1e7     |1e2 cardinality factor, 5% NAs, unsorted data   |advanced       |max v1 - min v2 by id3      |            1.562|            2.441|  1.5627401|
|1e7     |1e2 cardinality factor, 5% NAs, unsorted data   |advanced       |largest two v3 by id6       |            2.338|            2.709|  1.1586826|
|1e7     |1e2 cardinality factor, 5% NAs, unsorted data   |advanced       |regression v1 v2 by id2 id4 |            2.575|            2.875|  1.1165049|
|1e7     |1e2 cardinality factor, 5% NAs, unsorted data   |advanced       |sum v3 count by id1:id6     |            2.776|            2.910|  1.0482709|
|1e8     |1e2 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 by id1               |            0.942|            5.113|  5.4278132|
|1e8     |1e2 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 by id1:id2           |            0.771|           11.613| 15.0622568|
|1e8     |1e2 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 mean v3 by id3       |            3.755|           20.063|  5.3430093|
|1e8     |1e2 cardinality factor, 0% NAs, unsorted data   |basic          |mean v1:v3 by id4           |            1.301|            1.613|  1.2398155|
|1e8     |1e2 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1:v3 by id6            |            3.928|            4.024|  1.0244399|
|1e8     |1e2 cardinality factor, 0% NAs, unsorted data   |advanced       |median v3 sd v3 by id4 id5  |           13.667|           23.466|  1.7169825|
|1e8     |1e2 cardinality factor, 0% NAs, unsorted data   |advanced       |max v1 - min v2 by id3      |           16.306|           58.582|  3.5926653|
|1e8     |1e2 cardinality factor, 0% NAs, unsorted data   |advanced       |largest two v3 by id6       |           25.756|           98.894|  3.8396490|
|1e8     |1e2 cardinality factor, 0% NAs, unsorted data   |advanced       |regression v1 v2 by id2 id4 |           10.469|           37.493|  3.5813354|
|1e8     |1e2 cardinality factor, 0% NAs, unsorted data   |advanced       |sum v3 count by id1:id6     |           18.766|           67.441|  3.5937866|
|1e8     |1e1 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 by id1               |            1.000|            0.820|  0.8200000|
|1e8     |1e1 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 by id1:id2           |            0.710|            0.767|  1.0802817|
|1e8     |1e1 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 mean v3 by id3       |            9.344|           27.268|  2.9182363|
|1e8     |1e1 cardinality factor, 0% NAs, unsorted data   |basic          |mean v1:v3 by id4           |            1.319|            1.604|  1.2160728|
|1e8     |1e1 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1:v3 by id6            |           12.834|           15.836|  1.2339099|
|1e8     |1e1 cardinality factor, 0% NAs, unsorted data   |advanced       |median v3 sd v3 by id4 id5  |            9.662|           11.922|  1.2339060|
|1e8     |1e1 cardinality factor, 0% NAs, unsorted data   |advanced       |max v1 - min v2 by id3      |           25.419|           51.539|  2.0275778|
|1e8     |1e1 cardinality factor, 0% NAs, unsorted data   |advanced       |largest two v3 by id6       |           56.707|           85.863|  1.5141517|
|1e8     |1e1 cardinality factor, 0% NAs, unsorted data   |advanced       |regression v1 v2 by id2 id4 |            8.683|           13.469|  1.5511920|
|1e8     |1e1 cardinality factor, 0% NAs, unsorted data   |advanced       |sum v3 count by id1:id6     |           21.002|           32.440|  1.5446148|
|1e8     |2e0 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 by id1               |            1.156|            0.880|  0.7612457|
|1e8     |2e0 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 by id1:id2           |            0.698|            0.682|  0.9770774|
|1e8     |2e0 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 mean v3 by id3       |           12.803|           26.985|  2.1077091|
|1e8     |2e0 cardinality factor, 0% NAs, unsorted data   |basic          |mean v1:v3 by id4           |            1.678|            1.205|  0.7181168|
|1e8     |2e0 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1:v3 by id6            |           25.847|           25.933|  1.0033273|
|1e8     |2e0 cardinality factor, 0% NAs, unsorted data   |advanced       |median v3 sd v3 by id4 id5  |            9.075|            9.421|  1.0381267|
|1e8     |2e0 cardinality factor, 0% NAs, unsorted data   |advanced       |max v1 - min v2 by id3      |           40.396|           54.720|  1.3545896|
|1e8     |2e0 cardinality factor, 0% NAs, unsorted data   |advanced       |largest two v3 by id6       |          126.466|          136.319|  1.0779103|
|1e8     |2e0 cardinality factor, 0% NAs, unsorted data   |advanced       |regression v1 v2 by id2 id4 |            7.443|            7.577|  1.0180035|
|1e8     |2e0 cardinality factor, 0% NAs, unsorted data   |advanced       |sum v3 count by id1:id6     |           26.716|           31.831|  1.1914583|
|1e8     |1e2 cardinality factor, 0% NAs, pre-sorted data |basic          |sum v1 by id1               |            1.157|            1.185|  1.0242005|
|1e8     |1e2 cardinality factor, 0% NAs, pre-sorted data |basic          |sum v1 by id1:id2           |            0.886|            0.894|  1.0090293|
|1e8     |1e2 cardinality factor, 0% NAs, pre-sorted data |basic          |sum v1 mean v3 by id3       |            3.597|           17.820|  4.9541284|
|1e8     |1e2 cardinality factor, 0% NAs, pre-sorted data |basic          |mean v1:v3 by id4           |            1.198|            1.551|  1.2946578|
|1e8     |1e2 cardinality factor, 0% NAs, pre-sorted data |basic          |sum v1:v3 by id6            |            3.997|            4.065|  1.0170128|
|1e8     |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced       |median v3 sd v3 by id4 id5  |           14.362|           17.028|  1.1856287|
|1e8     |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced       |max v1 - min v2 by id3      |           16.873|           38.902|  2.3055770|
|1e8     |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced       |largest two v3 by id6       |           26.762|           44.560|  1.6650475|
|1e8     |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced       |regression v1 v2 by id2 id4 |            5.496|            8.364|  1.5218341|
|1e8     |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced       |sum v3 count by id1:id6     |           18.907|           32.453|  1.7164542|
|1e8     |1e2 cardinality factor, 5% NAs, unsorted data   |basic          |sum v1 by id1               |            1.000|            5.539|  5.5390000|
|1e8     |1e2 cardinality factor, 5% NAs, unsorted data   |basic          |sum v1 by id1:id2           |            0.809|           13.368| 16.5241038|
|1e8     |1e2 cardinality factor, 5% NAs, unsorted data   |basic          |sum v1 mean v3 by id3       |            3.727|           24.760|  6.6434129|
|1e8     |1e2 cardinality factor, 5% NAs, unsorted data   |basic          |mean v1:v3 by id4           |            1.733|            2.706|  1.5614541|
|1e8     |1e2 cardinality factor, 5% NAs, unsorted data   |basic          |sum v1:v3 by id6            |            4.255|            4.770|  1.1210341|
|1e8     |1e2 cardinality factor, 5% NAs, unsorted data   |advanced       |median v3 sd v3 by id4 id5  |           14.674|           32.064|  2.1850893|
|1e8     |1e2 cardinality factor, 5% NAs, unsorted data   |advanced       |max v1 - min v2 by id3      |           17.914|           52.269|  2.9177738|
|1e8     |1e2 cardinality factor, 5% NAs, unsorted data   |advanced       |largest two v3 by id6       |           26.448|          107.006|  4.0459014|
|1e8     |1e2 cardinality factor, 5% NAs, unsorted data   |advanced       |regression v1 v2 by id2 id4 |           15.541|           38.860|  2.5004826|
|1e8     |1e2 cardinality factor, 5% NAs, unsorted data   |advanced       |sum v3 count by id1:id6     |           19.549|           67.361|  3.4457517|
|1e9     |1e2 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 by id1               |           15.705|               NA|         NA|
|1e9     |1e2 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 by id1:id2           |            9.075|               NA|         NA|
|1e9     |1e2 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 mean v3 by id3       |           89.388|               NA|         NA|
|1e9     |1e2 cardinality factor, 0% NAs, unsorted data   |basic          |mean v1:v3 by id4           |           23.274|               NA|         NA|
|1e9     |1e2 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1:v3 by id6            |          120.389|               NA|         NA|
|1e9     |1e2 cardinality factor, 0% NAs, unsorted data   |advanced       |median v3 sd v3 by id4 id5  |          195.448|               NA|         NA|
|1e9     |1e2 cardinality factor, 0% NAs, unsorted data   |advanced       |max v1 - min v2 by id3      |          357.317|               NA|         NA|
|1e9     |1e2 cardinality factor, 0% NAs, unsorted data   |advanced       |largest two v3 by id6       |               NA|               NA|         NA|
|1e9     |1e2 cardinality factor, 0% NAs, unsorted data   |advanced       |regression v1 v2 by id2 id4 |               NA|               NA|         NA|
|1e9     |1e2 cardinality factor, 0% NAs, unsorted data   |advanced       |sum v3 count by id1:id6     |               NA|               NA|         NA|
|1e9     |1e1 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 by id1               |               NA|           11.811|         NA|
|1e9     |1e1 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 by id1:id2           |               NA|           11.783|         NA|
|1e9     |1e1 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 mean v3 by id3       |               NA|               NA|         NA|
|1e9     |1e1 cardinality factor, 0% NAs, unsorted data   |basic          |mean v1:v3 by id4           |               NA|               NA|         NA|
|1e9     |1e1 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1:v3 by id6            |               NA|               NA|         NA|
|1e9     |1e1 cardinality factor, 0% NAs, unsorted data   |advanced       |median v3 sd v3 by id4 id5  |               NA|               NA|         NA|
|1e9     |1e1 cardinality factor, 0% NAs, unsorted data   |advanced       |max v1 - min v2 by id3      |               NA|               NA|         NA|
|1e9     |1e1 cardinality factor, 0% NAs, unsorted data   |advanced       |largest two v3 by id6       |               NA|               NA|         NA|
|1e9     |1e1 cardinality factor, 0% NAs, unsorted data   |advanced       |regression v1 v2 by id2 id4 |               NA|               NA|         NA|
|1e9     |1e1 cardinality factor, 0% NAs, unsorted data   |advanced       |sum v3 count by id1:id6     |               NA|               NA|         NA|
|1e9     |2e0 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 by id1               |               NA|            9.736|         NA|
|1e9     |2e0 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 by id1:id2           |               NA|           12.841|         NA|
|1e9     |2e0 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1 mean v3 by id3       |               NA|               NA|         NA|
|1e9     |2e0 cardinality factor, 0% NAs, unsorted data   |basic          |mean v1:v3 by id4           |               NA|               NA|         NA|
|1e9     |2e0 cardinality factor, 0% NAs, unsorted data   |basic          |sum v1:v3 by id6            |               NA|               NA|         NA|
|1e9     |2e0 cardinality factor, 0% NAs, unsorted data   |advanced       |median v3 sd v3 by id4 id5  |               NA|               NA|         NA|
|1e9     |2e0 cardinality factor, 0% NAs, unsorted data   |advanced       |max v1 - min v2 by id3      |               NA|               NA|         NA|
|1e9     |2e0 cardinality factor, 0% NAs, unsorted data   |advanced       |largest two v3 by id6       |               NA|               NA|         NA|
|1e9     |2e0 cardinality factor, 0% NAs, unsorted data   |advanced       |regression v1 v2 by id2 id4 |               NA|               NA|         NA|
|1e9     |2e0 cardinality factor, 0% NAs, unsorted data   |advanced       |sum v3 count by id1:id6     |               NA|               NA|         NA|
|1e9     |1e2 cardinality factor, 0% NAs, pre-sorted data |basic          |sum v1 by id1               |            9.445|           12.693|  1.3438857|
|1e9     |1e2 cardinality factor, 0% NAs, pre-sorted data |basic          |sum v1 by id1:id2           |            8.576|           16.792|  1.9580224|
|1e9     |1e2 cardinality factor, 0% NAs, pre-sorted data |basic          |sum v1 mean v3 by id3       |           54.936|               NA|         NA|
|1e9     |1e2 cardinality factor, 0% NAs, pre-sorted data |basic          |mean v1:v3 by id4           |           11.435|               NA|         NA|
|1e9     |1e2 cardinality factor, 0% NAs, pre-sorted data |basic          |sum v1:v3 by id6            |           79.478|               NA|         NA|
|1e9     |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced       |median v3 sd v3 by id4 id5  |          193.551|               NA|         NA|
|1e9     |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced       |max v1 - min v2 by id3      |          244.512|               NA|         NA|
|1e9     |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced       |largest two v3 by id6       |               NA|               NA|         NA|
|1e9     |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced       |regression v1 v2 by id2 id4 |               NA|               NA|         NA|
|1e9     |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced       |sum v3 count by id1:id6     |               NA|               NA|         NA|
|1e9     |1e2 cardinality factor, 5% NAs, unsorted data   |basic          |sum v1 by id1               |               NA|               NA|         NA|
|1e9     |1e2 cardinality factor, 5% NAs, unsorted data   |basic          |sum v1 by id1:id2           |               NA|               NA|         NA|
|1e9     |1e2 cardinality factor, 5% NAs, unsorted data   |basic          |sum v1 mean v3 by id3       |               NA|               NA|         NA|
|1e9     |1e2 cardinality factor, 5% NAs, unsorted data   |basic          |mean v1:v3 by id4           |               NA|               NA|         NA|
|1e9     |1e2 cardinality factor, 5% NAs, unsorted data   |basic          |sum v1:v3 by id6            |               NA|               NA|         NA|
|1e9     |1e2 cardinality factor, 5% NAs, unsorted data   |advanced       |median v3 sd v3 by id4 id5  |               NA|               NA|         NA|
|1e9     |1e2 cardinality factor, 5% NAs, unsorted data   |advanced       |max v1 - min v2 by id3      |               NA|               NA|         NA|
|1e9     |1e2 cardinality factor, 5% NAs, unsorted data   |advanced       |largest two v3 by id6       |               NA|               NA|         NA|
|1e9     |1e2 cardinality factor, 5% NAs, unsorted data   |advanced       |regression v1 v2 by id2 id4 |               NA|               NA|         NA|
|1e9     |1e2 cardinality factor, 5% NAs, unsorted data   |advanced       |sum v3 count by id1:id6     |               NA|               NA|         NA|

join

|in_rows |knasorted               |question               | 20210426_78028b8| 20210427_78028b8|   new2old|
|:-------|:-----------------------|:----------------------|----------------:|----------------:|---------:|
|1e7     |0% NAs, unsorted data   |small inner on int     |            0.890|            0.932| 1.0471910|
|1e7     |0% NAs, unsorted data   |medium inner on int    |            0.808|            0.854| 1.0569307|
|1e7     |0% NAs, unsorted data   |medium outer on int    |            3.030|            2.416| 0.7973597|
|1e7     |0% NAs, unsorted data   |medium inner on factor |            0.959|            1.612| 1.6809176|
|1e7     |0% NAs, unsorted data   |big inner on int       |            2.387|            2.534| 1.0615836|
|1e7     |5% NAs, unsorted data   |small inner on int     |            0.999|            1.270| 1.2712713|
|1e7     |5% NAs, unsorted data   |medium inner on int    |            0.863|            1.332| 1.5434531|
|1e7     |5% NAs, unsorted data   |medium outer on int    |            3.154|            3.024| 0.9587825|
|1e7     |5% NAs, unsorted data   |medium inner on factor |            1.084|            1.979| 1.8256458|
|1e7     |5% NAs, unsorted data   |big inner on int       |            3.804|            4.654| 1.2234490|
|1e7     |0% NAs, pre-sorted data |small inner on int     |            0.720|            0.854| 1.1861111|
|1e7     |0% NAs, pre-sorted data |medium inner on int    |            0.742|            0.802| 1.0808625|
|1e7     |0% NAs, pre-sorted data |medium outer on int    |            2.449|            2.295| 0.9371172|
|1e7     |0% NAs, pre-sorted data |medium inner on factor |            0.840|            0.888| 1.0571429|
|1e7     |0% NAs, pre-sorted data |big inner on int       |            1.452|            1.529| 1.0530303|
|1e8     |0% NAs, unsorted data   |small inner on int     |           82.456|           91.359| 1.1079727|
|1e8     |0% NAs, unsorted data   |medium inner on int    |           94.706|          177.157| 1.8705995|
|1e8     |0% NAs, unsorted data   |medium outer on int    |          112.529|          187.385| 1.6652152|
|1e8     |0% NAs, unsorted data   |medium inner on factor |           96.223|          187.460| 1.9481829|
|1e8     |0% NAs, unsorted data   |big inner on int       |           91.470|          198.955| 2.1750847|
|1e8     |5% NAs, unsorted data   |small inner on int     |           92.100|           83.704| 0.9088382|
|1e8     |5% NAs, unsorted data   |medium inner on int    |           93.817|          155.296| 1.6553077|
|1e8     |5% NAs, unsorted data   |medium outer on int    |          110.364|          182.278| 1.6516074|
|1e8     |5% NAs, unsorted data   |medium inner on factor |           97.051|          191.718| 1.9754356|
|1e8     |5% NAs, unsorted data   |big inner on int       |          130.767|          247.625| 1.8936352|
|1e8     |0% NAs, pre-sorted data |small inner on int     |          100.430|           49.841| 0.4962760|
|1e8     |0% NAs, pre-sorted data |medium inner on int    |           90.533|           94.588| 1.0447903|
|1e8     |0% NAs, pre-sorted data |medium outer on int    |          104.135|           98.195| 0.9429587|
|1e8     |0% NAs, pre-sorted data |medium inner on factor |           93.672|          103.375| 1.1035848|
|1e8     |0% NAs, pre-sorted data |big inner on int       |           83.728|          103.645| 1.2378774|
|1e9     |0% NAs, unsorted data   |small inner on int     |               NA|               NA|        NA|
|1e9     |0% NAs, unsorted data   |medium inner on int    |               NA|               NA|        NA|
|1e9     |0% NAs, unsorted data   |medium outer on int    |               NA|               NA|        NA|
|1e9     |0% NAs, unsorted data   |medium inner on factor |               NA|               NA|        NA|
|1e9     |0% NAs, unsorted data   |big inner on int       |               NA|               NA|        NA|
|1e9     |5% NAs, unsorted data   |small inner on int     |               NA|               NA|        NA|
|1e9     |5% NAs, unsorted data   |medium inner on int    |               NA|               NA|        NA|
|1e9     |5% NAs, unsorted data   |medium outer on int    |               NA|               NA|        NA|
|1e9     |5% NAs, unsorted data   |medium inner on factor |               NA|               NA|        NA|
|1e9     |5% NAs, unsorted data   |big inner on int       |               NA|               NA|        NA|
|1e9     |0% NAs, pre-sorted data |small inner on int     |               NA|               NA|        NA|
|1e9     |0% NAs, pre-sorted data |medium inner on int    |               NA|               NA|        NA|
|1e9     |0% NAs, pre-sorted data |medium outer on int    |               NA|               NA|        NA|
|1e9     |0% NAs, pre-sorted data |medium inner on factor |               NA|               NA|        NA|
|1e9     |0% NAs, pre-sorted data |big inner on int       |               NA|               NA|        NA|
bkamins commented 3 years ago

As noted above I think we added it because it uses less memory.

If we cannot resolve it otherwise we should re-introduce it, but I would hope that @quinnj will be able to make CSV.jl work well under default settings (these tests are great in highlighting the areas that are problematic).

bkamins commented 3 years ago

Just to highlight the issue: all the regressions are caused by string columns (and we are aware we need to improve this area, as Julia has a significantly different way of handling them than e.g. R). E.g. - for a reference - in join tests we regress because of non-key columns (so essentially the regression is due to basic getindex/copyto! slowdown).

bkamins commented 3 years ago

Compare e.g. lines:

|1e8     |0% NAs, unsorted data   |big inner on int       |           91.470|          198.955| 2.1750847|
|1e8     |0% NAs, pre-sorted data |big inner on int       |           83.728|          103.645| 1.2378774|

in the previous benchmark we were faster on sorted data but only by a little bit (and the difference was expected). Now we have a huge gap in performance. The 8 seconds of the difference originally is due to joining (we have faster join for pre-sorted data), so this means that the remaining 90 seconds gap is due to slower getindex/copyto! of non-key columns under the default CSV reading setup.