Is the use of mean consistent across tools?

h2oai / db-benchmark

reproducible benchmark of database-like ops

https://h2oai.github.io/db-benchmark

Mozilla Public License 2.0

321 stars 85 forks source link

Is the use of mean consistent across tools? #157

Closed MichaelChirico closed 3 years ago

MichaelChirico commented 3 years ago

I am looking at the benchmark and the performance of spark on a task involving only mean aggregation stands out -- is Spark by chance not doing the error correction double-pass that's done in R and data.table as well?

If so that would seem to give an (IMO) unfair advantage to tools that give numerically inferior results.

At the least this could be pointed out somewhere (I don't see it mentioned anywhere in the repo thus far).

jangorecki commented 3 years ago

Afair DT does not do double pass, just R does. It make sense to use an ordinary mean, and if it is not available, then IMO it should be made available.

MichaelChirico commented 3 years ago

I see, I think I was mistaken in thinking GForce is doing double pass. But found this open issue:

https://github.com/Rdatatable/data.table/issues/1970

jangorecki commented 3 years ago

You can do a 2 pass algorithm using rollmean with n equal to length of x, and algo exact :)