Closed bkamins closed 1 year ago
@jangorecki - I opened this PR as we have not run the benchmarks for DataFrames.jl for some time (mostly because we do not have CSV.jl 1.0 release yet) and people are asking for re-running of the benchmarks. Thank you!
CC @nalimilan @quinnj @cmey
Thank you for PR. Could you please revert point 1 of this PR? There were ideas to use similar techniques in other tools but I would like to keep scripts reproducible interactively line by line, at least where it is possible (not possible for dask and clickhouse). This makes it easier to debug and also makes it easier to match corresponding code between solutions or languages. Note that this project is now maintained by @mattdowle
Done - thank you!
@mattdowle - if you would have any additional questions or comments please let me know. In general in probably 1-2 months we will have a Julia 1.7 release and CSV.jl 1.0 release when I would open another PR to update the codes to reflect these changes.
@quinnj - can you please review if the current setup of CSV reading is correct? I still have WeakRefStrings.jl as a dependency, since CSV.jl does not export the string types yet. Thank you!
Yeah, LGTM
@mattdowle - if you would plan to re-run the benchmark it would be great if this PR got merged before. Thank you!
Is it premature to merge this and benchmark?
Julia 1.6.3 is out and 1.7 is close. I can't see here if it would use CSV.jl 0.9.x which I believe would provide a speedup. Julia 1.6.3 maybe not be faster than 1.6.2 (or 1.6.1), used here, but we might as well use the latest, so that people will not be in doubt. Do you know if the 1.7-rc1 is faster? Then maybe wait a bit longer for its release?
I think the objective of this PR is to give the current state of DataFrames.jl speed (the versions of the packages etc. are always a moving target that is why in the past the benchmarks were regularly re-run).
Julia 1.6.3 is out
Indeed this PR requires an update - I will make the changes.
I can't see here if it would use CSV.jl 0.9.x
It would
Do you know if the 1.7-rc1 is faster?
Given the planned changes in 1.7 I do not see any significant reason for it to be faster. The compilation latency will be probably a bit lower, but it would not matter that much for large benchmarks (which are most interesting).
Now that Julia 1.7 and DataFrames 1.3 are out, I think it is time to update and merge this.
I do not think these benchmarks have been maintained since early summer 2021.
@jangorecki can you confirm? If so, would you be willing to add a maintainer to the repo so the benchmark can continue to be updated? This is a really good set of benchmarks, and it would be a shame to lose it.
Unfortunately I am no longer working for H2O, therefore I don't have access to the repository to set new maintainer, as well as to the machine to run the benchmark. You may try to reach out @srisatish finding new maintainer.
closing as it seems the benchmarks are not maintained anymore.
I think having PR open is fine. If project is unmaintained anymore, as it seems to be, then it will be best to have that mentioned on top of the readme file as there are multiple developers from different projects awaiting reply from h2o. I always suggested to contact h2o support but have no idea what's the status.
I always suggested to contact h2o support but have no idea what's the status.
I have tried contacting H2O with no response.
This PR objective is to finalize changes in https://github.com/h2oai/db-benchmark/pull/230 and also implemented some code changes:
String
orSymbol
(as I get repeated requests to add it manually before CSV.jl gets 1.0 release when it will be the default);
at end of line (so that it is present only for top-level statements)