Closed bkamins closed 3 years ago
@jangorecki - I would try to see the next benchmark results with the default settings of DataFrames.jl and CSV.jl after 1.0 release of DataFrames.jl. This would help us with tuning things if there is a need. Are you OK with this? (also API got cleaned up a bit as you can see in the changed codes)
CC @nalimilan @quinnj
I added some changes to how GC.gc is triggered following the recommendation by Julia devs how to set GC to its normal state before running the benchmark.
What is the point of 6ba25aa? "ver" scripts are only used for dumping package version so number of threads does not make any difference.
BTW. data.table's default is 50% cores but in each script we set to use 100% (setDTthreads(0L)
).
I am running 1.0.1 now, once it is completed I will merge this PR and run again on 1.0.1.
Ah - so this should be fixed. I will revert.
In what file/line do you start Julia process that does the actual computation? (I guess now it is single threaded - maybe we can leave it for the current benchmarks - to see single threaded performance and then in the next round enable threading to see the impact - this would be very useful for us)
OK - fixed.
In what file/line do you start Julia process that does the actual computation?
https://github.com/h2oai/db-benchmark/blob/ff6975310f8f818462f7357551079bf0d8f9fc51/_launcher/solution.R#L162
It is being run using ./script.jl
kind of command, so the header of benchmark script file (#!/usr/bin/env julia
) directs to julia process.
then in the next round enable threading to see the impact
so ideally will be to have this in a separate PR then
so ideally will be to have this in a separate PR then
Yes. I would make a separate PR later next month. Thank you!
I posted timings of 1.0.1 vs 0.22.7 in https://github.com/h2oai/db-benchmark/issues/195#issuecomment-827398822 There is a big speed up in majority of cases.
Thank you for posting this. When is the next re-run of benchmarks planned? (in the we have found some cases that we are currently fixing as 1.0 was a major re-write)
I an additional question: https://h2oai.github.io/db-benchmark/history.html does not seem to be updated. Do I see it correctly?
In about 2 hours run including this PR should be finished.
history report looks fine for me, try another browser or clearing browser cache. It happened to me multiple times that browser was showing cached images.
Ah - thank you!
@quinnj - the benchmarks after merging this PR are out. Actually we get a significant regression and for groupby
tests 50GB tests run out of memory (in the previous settings they worked). We need to investigate the reason for the regression.
@jangorecki - thank you very much for providing such tests.
@quinnj - the benchmarks after merging this PR are out. Actually we get a significant regression and for
groupby
tests 50GB tests run out of memory (in the previous settings they worked). We need to investigate the reason for the regression.
Maybe we should reintroduce the types
argument? As noted above I think we added it because it uses less memory.
timings diff of this PR, looks to be much slower
|in_rows |knasorted |question_group |question | 20210426_78028b8| 20210427_78028b8| new2old|
|:-------|:-----------------------------------------------|:--------------|:---------------------------|----------------:|----------------:|----------:|
|1e7 |1e2 cardinality factor, 0% NAs, unsorted data |basic |sum v1 by id1 | 0.304| 0.696| 2.2894737|
|1e7 |1e2 cardinality factor, 0% NAs, unsorted data |basic |sum v1 by id1:id2 | 0.084| 1.117| 13.2976190|
|1e7 |1e2 cardinality factor, 0% NAs, unsorted data |basic |sum v1 mean v3 by id3 | 0.456| 1.428| 3.1315789|
|1e7 |1e2 cardinality factor, 0% NAs, unsorted data |basic |mean v1:v3 by id4 | 0.185| 0.182| 0.9837838|
|1e7 |1e2 cardinality factor, 0% NAs, unsorted data |basic |sum v1:v3 by id6 | 0.346| 0.348| 1.0057803|
|1e7 |1e2 cardinality factor, 0% NAs, unsorted data |advanced |median v3 sd v3 by id4 id5 | 1.861| 2.011| 1.0806018|
|1e7 |1e2 cardinality factor, 0% NAs, unsorted data |advanced |max v1 - min v2 by id3 | 1.400| 2.359| 1.6850000|
|1e7 |1e2 cardinality factor, 0% NAs, unsorted data |advanced |largest two v3 by id6 | 2.096| 2.410| 1.1498092|
|1e7 |1e2 cardinality factor, 0% NAs, unsorted data |advanced |regression v1 v2 by id2 id4 | 1.266| 2.046| 1.6161137|
|1e7 |1e2 cardinality factor, 0% NAs, unsorted data |advanced |sum v3 count by id1:id6 | 2.484| 2.378| 0.9573269|
|1e7 |1e1 cardinality factor, 0% NAs, unsorted data |basic |sum v1 by id1 | 0.322| 0.314| 0.9751553|
|1e7 |1e1 cardinality factor, 0% NAs, unsorted data |basic |sum v1 by id1:id2 | 0.073| 0.074| 1.0136986|
|1e7 |1e1 cardinality factor, 0% NAs, unsorted data |basic |sum v1 mean v3 by id3 | 0.610| 1.858| 3.0459016|
|1e7 |1e1 cardinality factor, 0% NAs, unsorted data |basic |mean v1:v3 by id4 | 0.206| 0.188| 0.9126214|
|1e7 |1e1 cardinality factor, 0% NAs, unsorted data |basic |sum v1:v3 by id6 | 0.627| 0.577| 0.9202552|
|1e7 |1e1 cardinality factor, 0% NAs, unsorted data |advanced |median v3 sd v3 by id4 id5 | 1.390| 1.430| 1.0287770|
|1e7 |1e1 cardinality factor, 0% NAs, unsorted data |advanced |max v1 - min v2 by id3 | 2.263| 3.534| 1.5616438|
|1e7 |1e1 cardinality factor, 0% NAs, unsorted data |advanced |largest two v3 by id6 | 4.423| 4.389| 0.9923129|
|1e7 |1e1 cardinality factor, 0% NAs, unsorted data |advanced |regression v1 v2 by id2 id4 | 0.930| 0.912| 0.9806452|
|1e7 |1e1 cardinality factor, 0% NAs, unsorted data |advanced |sum v3 count by id1:id6 | 2.485| 2.206| 0.8877264|
|1e7 |2e0 cardinality factor, 0% NAs, unsorted data |basic |sum v1 by id1 | 0.305| 0.299| 0.9803279|
|1e7 |2e0 cardinality factor, 0% NAs, unsorted data |basic |sum v1 by id1:id2 | 0.072| 0.070| 0.9722222|
|1e7 |2e0 cardinality factor, 0% NAs, unsorted data |basic |sum v1 mean v3 by id3 | 0.755| 1.879| 2.4887417|
|1e7 |2e0 cardinality factor, 0% NAs, unsorted data |basic |mean v1:v3 by id4 | 0.191| 0.188| 0.9842932|
|1e7 |2e0 cardinality factor, 0% NAs, unsorted data |basic |sum v1:v3 by id6 | 1.142| 1.140| 0.9982487|
|1e7 |2e0 cardinality factor, 0% NAs, unsorted data |advanced |median v3 sd v3 by id4 id5 | 1.126| 1.138| 1.0106572|
|1e7 |2e0 cardinality factor, 0% NAs, unsorted data |advanced |max v1 - min v2 by id3 | 3.156| 4.243| 1.3444233|
|1e7 |2e0 cardinality factor, 0% NAs, unsorted data |advanced |largest two v3 by id6 | 8.847| 8.438| 0.9537696|
|1e7 |2e0 cardinality factor, 0% NAs, unsorted data |advanced |regression v1 v2 by id2 id4 | 0.809| 0.819| 1.0123609|
|1e7 |2e0 cardinality factor, 0% NAs, unsorted data |advanced |sum v3 count by id1:id6 | 2.479| 2.289| 0.9233562|
|1e7 |1e2 cardinality factor, 0% NAs, pre-sorted data |basic |sum v1 by id1 | 0.328| 0.318| 0.9695122|
|1e7 |1e2 cardinality factor, 0% NAs, pre-sorted data |basic |sum v1 by id1:id2 | 0.090| 0.090| 1.0000000|
|1e7 |1e2 cardinality factor, 0% NAs, pre-sorted data |basic |sum v1 mean v3 by id3 | 0.457| 1.156| 2.5295405|
|1e7 |1e2 cardinality factor, 0% NAs, pre-sorted data |basic |mean v1:v3 by id4 | 0.182| 0.181| 0.9945055|
|1e7 |1e2 cardinality factor, 0% NAs, pre-sorted data |basic |sum v1:v3 by id6 | 0.383| 0.365| 0.9530026|
|1e7 |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced |median v3 sd v3 by id4 id5 | 1.898| 1.883| 0.9920969|
|1e7 |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced |max v1 - min v2 by id3 | 1.334| 2.024| 1.5172414|
|1e7 |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced |largest two v3 by id6 | 1.997| 2.087| 1.0450676|
|1e7 |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced |regression v1 v2 by id2 id4 | 0.857| 0.895| 1.0443407|
|1e7 |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced |sum v3 count by id1:id6 | 2.460| 2.176| 0.8845528|
|1e7 |1e2 cardinality factor, 5% NAs, unsorted data |basic |sum v1 by id1 | 0.343| 0.981| 2.8600583|
|1e7 |1e2 cardinality factor, 5% NAs, unsorted data |basic |sum v1 by id1:id2 | 0.084| 1.348| 16.0476190|
|1e7 |1e2 cardinality factor, 5% NAs, unsorted data |basic |sum v1 mean v3 by id3 | 0.486| 1.544| 3.1769547|
|1e7 |1e2 cardinality factor, 5% NAs, unsorted data |basic |mean v1:v3 by id4 | 0.325| 1.016| 3.1261538|
|1e7 |1e2 cardinality factor, 5% NAs, unsorted data |basic |sum v1:v3 by id6 | 0.400| 0.390| 0.9750000|
|1e7 |1e2 cardinality factor, 5% NAs, unsorted data |advanced |median v3 sd v3 by id4 id5 | 2.042| 2.389| 1.1699314|
|1e7 |1e2 cardinality factor, 5% NAs, unsorted data |advanced |max v1 - min v2 by id3 | 1.562| 2.441| 1.5627401|
|1e7 |1e2 cardinality factor, 5% NAs, unsorted data |advanced |largest two v3 by id6 | 2.338| 2.709| 1.1586826|
|1e7 |1e2 cardinality factor, 5% NAs, unsorted data |advanced |regression v1 v2 by id2 id4 | 2.575| 2.875| 1.1165049|
|1e7 |1e2 cardinality factor, 5% NAs, unsorted data |advanced |sum v3 count by id1:id6 | 2.776| 2.910| 1.0482709|
|1e8 |1e2 cardinality factor, 0% NAs, unsorted data |basic |sum v1 by id1 | 0.942| 5.113| 5.4278132|
|1e8 |1e2 cardinality factor, 0% NAs, unsorted data |basic |sum v1 by id1:id2 | 0.771| 11.613| 15.0622568|
|1e8 |1e2 cardinality factor, 0% NAs, unsorted data |basic |sum v1 mean v3 by id3 | 3.755| 20.063| 5.3430093|
|1e8 |1e2 cardinality factor, 0% NAs, unsorted data |basic |mean v1:v3 by id4 | 1.301| 1.613| 1.2398155|
|1e8 |1e2 cardinality factor, 0% NAs, unsorted data |basic |sum v1:v3 by id6 | 3.928| 4.024| 1.0244399|
|1e8 |1e2 cardinality factor, 0% NAs, unsorted data |advanced |median v3 sd v3 by id4 id5 | 13.667| 23.466| 1.7169825|
|1e8 |1e2 cardinality factor, 0% NAs, unsorted data |advanced |max v1 - min v2 by id3 | 16.306| 58.582| 3.5926653|
|1e8 |1e2 cardinality factor, 0% NAs, unsorted data |advanced |largest two v3 by id6 | 25.756| 98.894| 3.8396490|
|1e8 |1e2 cardinality factor, 0% NAs, unsorted data |advanced |regression v1 v2 by id2 id4 | 10.469| 37.493| 3.5813354|
|1e8 |1e2 cardinality factor, 0% NAs, unsorted data |advanced |sum v3 count by id1:id6 | 18.766| 67.441| 3.5937866|
|1e8 |1e1 cardinality factor, 0% NAs, unsorted data |basic |sum v1 by id1 | 1.000| 0.820| 0.8200000|
|1e8 |1e1 cardinality factor, 0% NAs, unsorted data |basic |sum v1 by id1:id2 | 0.710| 0.767| 1.0802817|
|1e8 |1e1 cardinality factor, 0% NAs, unsorted data |basic |sum v1 mean v3 by id3 | 9.344| 27.268| 2.9182363|
|1e8 |1e1 cardinality factor, 0% NAs, unsorted data |basic |mean v1:v3 by id4 | 1.319| 1.604| 1.2160728|
|1e8 |1e1 cardinality factor, 0% NAs, unsorted data |basic |sum v1:v3 by id6 | 12.834| 15.836| 1.2339099|
|1e8 |1e1 cardinality factor, 0% NAs, unsorted data |advanced |median v3 sd v3 by id4 id5 | 9.662| 11.922| 1.2339060|
|1e8 |1e1 cardinality factor, 0% NAs, unsorted data |advanced |max v1 - min v2 by id3 | 25.419| 51.539| 2.0275778|
|1e8 |1e1 cardinality factor, 0% NAs, unsorted data |advanced |largest two v3 by id6 | 56.707| 85.863| 1.5141517|
|1e8 |1e1 cardinality factor, 0% NAs, unsorted data |advanced |regression v1 v2 by id2 id4 | 8.683| 13.469| 1.5511920|
|1e8 |1e1 cardinality factor, 0% NAs, unsorted data |advanced |sum v3 count by id1:id6 | 21.002| 32.440| 1.5446148|
|1e8 |2e0 cardinality factor, 0% NAs, unsorted data |basic |sum v1 by id1 | 1.156| 0.880| 0.7612457|
|1e8 |2e0 cardinality factor, 0% NAs, unsorted data |basic |sum v1 by id1:id2 | 0.698| 0.682| 0.9770774|
|1e8 |2e0 cardinality factor, 0% NAs, unsorted data |basic |sum v1 mean v3 by id3 | 12.803| 26.985| 2.1077091|
|1e8 |2e0 cardinality factor, 0% NAs, unsorted data |basic |mean v1:v3 by id4 | 1.678| 1.205| 0.7181168|
|1e8 |2e0 cardinality factor, 0% NAs, unsorted data |basic |sum v1:v3 by id6 | 25.847| 25.933| 1.0033273|
|1e8 |2e0 cardinality factor, 0% NAs, unsorted data |advanced |median v3 sd v3 by id4 id5 | 9.075| 9.421| 1.0381267|
|1e8 |2e0 cardinality factor, 0% NAs, unsorted data |advanced |max v1 - min v2 by id3 | 40.396| 54.720| 1.3545896|
|1e8 |2e0 cardinality factor, 0% NAs, unsorted data |advanced |largest two v3 by id6 | 126.466| 136.319| 1.0779103|
|1e8 |2e0 cardinality factor, 0% NAs, unsorted data |advanced |regression v1 v2 by id2 id4 | 7.443| 7.577| 1.0180035|
|1e8 |2e0 cardinality factor, 0% NAs, unsorted data |advanced |sum v3 count by id1:id6 | 26.716| 31.831| 1.1914583|
|1e8 |1e2 cardinality factor, 0% NAs, pre-sorted data |basic |sum v1 by id1 | 1.157| 1.185| 1.0242005|
|1e8 |1e2 cardinality factor, 0% NAs, pre-sorted data |basic |sum v1 by id1:id2 | 0.886| 0.894| 1.0090293|
|1e8 |1e2 cardinality factor, 0% NAs, pre-sorted data |basic |sum v1 mean v3 by id3 | 3.597| 17.820| 4.9541284|
|1e8 |1e2 cardinality factor, 0% NAs, pre-sorted data |basic |mean v1:v3 by id4 | 1.198| 1.551| 1.2946578|
|1e8 |1e2 cardinality factor, 0% NAs, pre-sorted data |basic |sum v1:v3 by id6 | 3.997| 4.065| 1.0170128|
|1e8 |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced |median v3 sd v3 by id4 id5 | 14.362| 17.028| 1.1856287|
|1e8 |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced |max v1 - min v2 by id3 | 16.873| 38.902| 2.3055770|
|1e8 |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced |largest two v3 by id6 | 26.762| 44.560| 1.6650475|
|1e8 |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced |regression v1 v2 by id2 id4 | 5.496| 8.364| 1.5218341|
|1e8 |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced |sum v3 count by id1:id6 | 18.907| 32.453| 1.7164542|
|1e8 |1e2 cardinality factor, 5% NAs, unsorted data |basic |sum v1 by id1 | 1.000| 5.539| 5.5390000|
|1e8 |1e2 cardinality factor, 5% NAs, unsorted data |basic |sum v1 by id1:id2 | 0.809| 13.368| 16.5241038|
|1e8 |1e2 cardinality factor, 5% NAs, unsorted data |basic |sum v1 mean v3 by id3 | 3.727| 24.760| 6.6434129|
|1e8 |1e2 cardinality factor, 5% NAs, unsorted data |basic |mean v1:v3 by id4 | 1.733| 2.706| 1.5614541|
|1e8 |1e2 cardinality factor, 5% NAs, unsorted data |basic |sum v1:v3 by id6 | 4.255| 4.770| 1.1210341|
|1e8 |1e2 cardinality factor, 5% NAs, unsorted data |advanced |median v3 sd v3 by id4 id5 | 14.674| 32.064| 2.1850893|
|1e8 |1e2 cardinality factor, 5% NAs, unsorted data |advanced |max v1 - min v2 by id3 | 17.914| 52.269| 2.9177738|
|1e8 |1e2 cardinality factor, 5% NAs, unsorted data |advanced |largest two v3 by id6 | 26.448| 107.006| 4.0459014|
|1e8 |1e2 cardinality factor, 5% NAs, unsorted data |advanced |regression v1 v2 by id2 id4 | 15.541| 38.860| 2.5004826|
|1e8 |1e2 cardinality factor, 5% NAs, unsorted data |advanced |sum v3 count by id1:id6 | 19.549| 67.361| 3.4457517|
|1e9 |1e2 cardinality factor, 0% NAs, unsorted data |basic |sum v1 by id1 | 15.705| NA| NA|
|1e9 |1e2 cardinality factor, 0% NAs, unsorted data |basic |sum v1 by id1:id2 | 9.075| NA| NA|
|1e9 |1e2 cardinality factor, 0% NAs, unsorted data |basic |sum v1 mean v3 by id3 | 89.388| NA| NA|
|1e9 |1e2 cardinality factor, 0% NAs, unsorted data |basic |mean v1:v3 by id4 | 23.274| NA| NA|
|1e9 |1e2 cardinality factor, 0% NAs, unsorted data |basic |sum v1:v3 by id6 | 120.389| NA| NA|
|1e9 |1e2 cardinality factor, 0% NAs, unsorted data |advanced |median v3 sd v3 by id4 id5 | 195.448| NA| NA|
|1e9 |1e2 cardinality factor, 0% NAs, unsorted data |advanced |max v1 - min v2 by id3 | 357.317| NA| NA|
|1e9 |1e2 cardinality factor, 0% NAs, unsorted data |advanced |largest two v3 by id6 | NA| NA| NA|
|1e9 |1e2 cardinality factor, 0% NAs, unsorted data |advanced |regression v1 v2 by id2 id4 | NA| NA| NA|
|1e9 |1e2 cardinality factor, 0% NAs, unsorted data |advanced |sum v3 count by id1:id6 | NA| NA| NA|
|1e9 |1e1 cardinality factor, 0% NAs, unsorted data |basic |sum v1 by id1 | NA| 11.811| NA|
|1e9 |1e1 cardinality factor, 0% NAs, unsorted data |basic |sum v1 by id1:id2 | NA| 11.783| NA|
|1e9 |1e1 cardinality factor, 0% NAs, unsorted data |basic |sum v1 mean v3 by id3 | NA| NA| NA|
|1e9 |1e1 cardinality factor, 0% NAs, unsorted data |basic |mean v1:v3 by id4 | NA| NA| NA|
|1e9 |1e1 cardinality factor, 0% NAs, unsorted data |basic |sum v1:v3 by id6 | NA| NA| NA|
|1e9 |1e1 cardinality factor, 0% NAs, unsorted data |advanced |median v3 sd v3 by id4 id5 | NA| NA| NA|
|1e9 |1e1 cardinality factor, 0% NAs, unsorted data |advanced |max v1 - min v2 by id3 | NA| NA| NA|
|1e9 |1e1 cardinality factor, 0% NAs, unsorted data |advanced |largest two v3 by id6 | NA| NA| NA|
|1e9 |1e1 cardinality factor, 0% NAs, unsorted data |advanced |regression v1 v2 by id2 id4 | NA| NA| NA|
|1e9 |1e1 cardinality factor, 0% NAs, unsorted data |advanced |sum v3 count by id1:id6 | NA| NA| NA|
|1e9 |2e0 cardinality factor, 0% NAs, unsorted data |basic |sum v1 by id1 | NA| 9.736| NA|
|1e9 |2e0 cardinality factor, 0% NAs, unsorted data |basic |sum v1 by id1:id2 | NA| 12.841| NA|
|1e9 |2e0 cardinality factor, 0% NAs, unsorted data |basic |sum v1 mean v3 by id3 | NA| NA| NA|
|1e9 |2e0 cardinality factor, 0% NAs, unsorted data |basic |mean v1:v3 by id4 | NA| NA| NA|
|1e9 |2e0 cardinality factor, 0% NAs, unsorted data |basic |sum v1:v3 by id6 | NA| NA| NA|
|1e9 |2e0 cardinality factor, 0% NAs, unsorted data |advanced |median v3 sd v3 by id4 id5 | NA| NA| NA|
|1e9 |2e0 cardinality factor, 0% NAs, unsorted data |advanced |max v1 - min v2 by id3 | NA| NA| NA|
|1e9 |2e0 cardinality factor, 0% NAs, unsorted data |advanced |largest two v3 by id6 | NA| NA| NA|
|1e9 |2e0 cardinality factor, 0% NAs, unsorted data |advanced |regression v1 v2 by id2 id4 | NA| NA| NA|
|1e9 |2e0 cardinality factor, 0% NAs, unsorted data |advanced |sum v3 count by id1:id6 | NA| NA| NA|
|1e9 |1e2 cardinality factor, 0% NAs, pre-sorted data |basic |sum v1 by id1 | 9.445| 12.693| 1.3438857|
|1e9 |1e2 cardinality factor, 0% NAs, pre-sorted data |basic |sum v1 by id1:id2 | 8.576| 16.792| 1.9580224|
|1e9 |1e2 cardinality factor, 0% NAs, pre-sorted data |basic |sum v1 mean v3 by id3 | 54.936| NA| NA|
|1e9 |1e2 cardinality factor, 0% NAs, pre-sorted data |basic |mean v1:v3 by id4 | 11.435| NA| NA|
|1e9 |1e2 cardinality factor, 0% NAs, pre-sorted data |basic |sum v1:v3 by id6 | 79.478| NA| NA|
|1e9 |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced |median v3 sd v3 by id4 id5 | 193.551| NA| NA|
|1e9 |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced |max v1 - min v2 by id3 | 244.512| NA| NA|
|1e9 |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced |largest two v3 by id6 | NA| NA| NA|
|1e9 |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced |regression v1 v2 by id2 id4 | NA| NA| NA|
|1e9 |1e2 cardinality factor, 0% NAs, pre-sorted data |advanced |sum v3 count by id1:id6 | NA| NA| NA|
|1e9 |1e2 cardinality factor, 5% NAs, unsorted data |basic |sum v1 by id1 | NA| NA| NA|
|1e9 |1e2 cardinality factor, 5% NAs, unsorted data |basic |sum v1 by id1:id2 | NA| NA| NA|
|1e9 |1e2 cardinality factor, 5% NAs, unsorted data |basic |sum v1 mean v3 by id3 | NA| NA| NA|
|1e9 |1e2 cardinality factor, 5% NAs, unsorted data |basic |mean v1:v3 by id4 | NA| NA| NA|
|1e9 |1e2 cardinality factor, 5% NAs, unsorted data |basic |sum v1:v3 by id6 | NA| NA| NA|
|1e9 |1e2 cardinality factor, 5% NAs, unsorted data |advanced |median v3 sd v3 by id4 id5 | NA| NA| NA|
|1e9 |1e2 cardinality factor, 5% NAs, unsorted data |advanced |max v1 - min v2 by id3 | NA| NA| NA|
|1e9 |1e2 cardinality factor, 5% NAs, unsorted data |advanced |largest two v3 by id6 | NA| NA| NA|
|1e9 |1e2 cardinality factor, 5% NAs, unsorted data |advanced |regression v1 v2 by id2 id4 | NA| NA| NA|
|1e9 |1e2 cardinality factor, 5% NAs, unsorted data |advanced |sum v3 count by id1:id6 | NA| NA| NA|
|in_rows |knasorted |question | 20210426_78028b8| 20210427_78028b8| new2old|
|:-------|:-----------------------|:----------------------|----------------:|----------------:|---------:|
|1e7 |0% NAs, unsorted data |small inner on int | 0.890| 0.932| 1.0471910|
|1e7 |0% NAs, unsorted data |medium inner on int | 0.808| 0.854| 1.0569307|
|1e7 |0% NAs, unsorted data |medium outer on int | 3.030| 2.416| 0.7973597|
|1e7 |0% NAs, unsorted data |medium inner on factor | 0.959| 1.612| 1.6809176|
|1e7 |0% NAs, unsorted data |big inner on int | 2.387| 2.534| 1.0615836|
|1e7 |5% NAs, unsorted data |small inner on int | 0.999| 1.270| 1.2712713|
|1e7 |5% NAs, unsorted data |medium inner on int | 0.863| 1.332| 1.5434531|
|1e7 |5% NAs, unsorted data |medium outer on int | 3.154| 3.024| 0.9587825|
|1e7 |5% NAs, unsorted data |medium inner on factor | 1.084| 1.979| 1.8256458|
|1e7 |5% NAs, unsorted data |big inner on int | 3.804| 4.654| 1.2234490|
|1e7 |0% NAs, pre-sorted data |small inner on int | 0.720| 0.854| 1.1861111|
|1e7 |0% NAs, pre-sorted data |medium inner on int | 0.742| 0.802| 1.0808625|
|1e7 |0% NAs, pre-sorted data |medium outer on int | 2.449| 2.295| 0.9371172|
|1e7 |0% NAs, pre-sorted data |medium inner on factor | 0.840| 0.888| 1.0571429|
|1e7 |0% NAs, pre-sorted data |big inner on int | 1.452| 1.529| 1.0530303|
|1e8 |0% NAs, unsorted data |small inner on int | 82.456| 91.359| 1.1079727|
|1e8 |0% NAs, unsorted data |medium inner on int | 94.706| 177.157| 1.8705995|
|1e8 |0% NAs, unsorted data |medium outer on int | 112.529| 187.385| 1.6652152|
|1e8 |0% NAs, unsorted data |medium inner on factor | 96.223| 187.460| 1.9481829|
|1e8 |0% NAs, unsorted data |big inner on int | 91.470| 198.955| 2.1750847|
|1e8 |5% NAs, unsorted data |small inner on int | 92.100| 83.704| 0.9088382|
|1e8 |5% NAs, unsorted data |medium inner on int | 93.817| 155.296| 1.6553077|
|1e8 |5% NAs, unsorted data |medium outer on int | 110.364| 182.278| 1.6516074|
|1e8 |5% NAs, unsorted data |medium inner on factor | 97.051| 191.718| 1.9754356|
|1e8 |5% NAs, unsorted data |big inner on int | 130.767| 247.625| 1.8936352|
|1e8 |0% NAs, pre-sorted data |small inner on int | 100.430| 49.841| 0.4962760|
|1e8 |0% NAs, pre-sorted data |medium inner on int | 90.533| 94.588| 1.0447903|
|1e8 |0% NAs, pre-sorted data |medium outer on int | 104.135| 98.195| 0.9429587|
|1e8 |0% NAs, pre-sorted data |medium inner on factor | 93.672| 103.375| 1.1035848|
|1e8 |0% NAs, pre-sorted data |big inner on int | 83.728| 103.645| 1.2378774|
|1e9 |0% NAs, unsorted data |small inner on int | NA| NA| NA|
|1e9 |0% NAs, unsorted data |medium inner on int | NA| NA| NA|
|1e9 |0% NAs, unsorted data |medium outer on int | NA| NA| NA|
|1e9 |0% NAs, unsorted data |medium inner on factor | NA| NA| NA|
|1e9 |0% NAs, unsorted data |big inner on int | NA| NA| NA|
|1e9 |5% NAs, unsorted data |small inner on int | NA| NA| NA|
|1e9 |5% NAs, unsorted data |medium inner on int | NA| NA| NA|
|1e9 |5% NAs, unsorted data |medium outer on int | NA| NA| NA|
|1e9 |5% NAs, unsorted data |medium inner on factor | NA| NA| NA|
|1e9 |5% NAs, unsorted data |big inner on int | NA| NA| NA|
|1e9 |0% NAs, pre-sorted data |small inner on int | NA| NA| NA|
|1e9 |0% NAs, pre-sorted data |medium inner on int | NA| NA| NA|
|1e9 |0% NAs, pre-sorted data |medium outer on int | NA| NA| NA|
|1e9 |0% NAs, pre-sorted data |medium inner on factor | NA| NA| NA|
|1e9 |0% NAs, pre-sorted data |big inner on int | NA| NA| NA|
As noted above I think we added it because it uses less memory.
If we cannot resolve it otherwise we should re-introduce it, but I would hope that @quinnj will be able to make CSV.jl work well under default settings (these tests are great in highlighting the areas that are problematic).
Just to highlight the issue: all the regressions are caused by string columns (and we are aware we need to improve this area, as Julia has a significantly different way of handling them than e.g. R). E.g. - for a reference - in join tests we regress because of non-key columns (so essentially the regression is due to basic getindex
/copyto!
slowdown).
Compare e.g. lines:
|1e8 |0% NAs, unsorted data |big inner on int | 91.470| 198.955| 2.1750847|
|1e8 |0% NAs, pre-sorted data |big inner on int | 83.728| 103.645| 1.2378774|
in the previous benchmark we were faster on sorted data but only by a little bit (and the difference was expected). Now we have a huge gap in performance. The 8 seconds of the difference originally is due to joining (we have faster join for pre-sorted data), so this means that the remaining 90 seconds gap is due to slower getindex
/copyto!
of non-key columns under the default CSV reading setup.
After 1.0 release of DataFrames.jl we have cleaned up API and integration with CSV.jl and it should be good to just use the default settings of the reader.